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When we first put the Hewlett-Packard Journal on the Web in May 1994, our online issues 
were offered in PostScript format and designed for the Mosaic browser. Today, we have a 
webmaster, our online issues live in PDF (Portable Document Format) files, we have a nice 
home page (see below), and we have many more features that take advantage of the latest 
Web technology. 

This month we are adding three new features to our site, and we thought it would be helpful to 



• Every two months Current Issue is 

updated with the latest issue. You can 
find the latest issue in Current Issue 
before it arrives in your mailbox. 

• We have issues back to February 
1994 in Past Issues. Since 1996 we 
have been including links to other HP 
sites containing relevant information 
about the products, research areas, 
or processes described in each 
issue. 

• To search for articles by title, subject, 
product, or author, use Index, which 
contains information about articles 
going back to the first issue of the 
Journal in 1949. Once you find an 

article, there is an Order button that tells you how to order the article or issue of interest. 

• Non-HP individuals in the U.S. can request a subscription to the Journal in Subscription 
Information. Guidelines for international subscriptions are also contained in this section. 

• To see articles that are not yet available in print, look in Previews. 

- About The Journal contains information on legal disclaimers and submission to the Journal. 

• If you want to be notified by e-mail when a new Journal issue is available, fill in the form in 
E-Mail Notification. 

This month we will celebrate the 25th anniversary of the HP 35 handheld calculator by offering 
online the original Journal issue, published in June 1972, which features that successful 
technology. 

Next month we plan to offer a new look to a previously published article titled, "Measuring 
Parasitic Capacitance and Inductance Using TOR, "by David J. Dasher. Using animation, 
several illustrations featuring the propagation of waveforms will be recreated. Whenever 
possible, we hope to use current Web tools to help describe complex technologies. 
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In this Issue 



In the last couple of years we have had the opportunity to chronicle the evolution 
oi design and development efforts associated with HP's newest microprocessors 
based on the PA-RISC architecture. The HP PA 8000 and PA 8200 microprocessors 
are the latest entries in this continuing evolution. The HP PA 8000 is the first HP 
processor to implement PA RISC 2.0 and the first capable of 64-bit operation. 
Among the features included in the HP PA 8000 are four-way superscalar proces- 
sors and mechanisms for out-of-order execution, which maximize instruction- 
level parallelism. The article on page 8 provides a brief overview PA RISC 2.0 
and describes the key architectural features, implementation details, and system 
performance attributes of these new microprocessors. 

Like all processor designs, design for the HP PA 8000 microprocessor involved a series of trade-offs 
between die area, complexity, performance, speed, power use, and design time. The article on page IB 
discusses these trade-offs and the design methodologies used for the HP PA 8000 processor. 

Because the advanced-microarchitecture PA 8000 microprocessor has so many new features, func- 
tional verification to identify defects that might cause the microprocessor to deviate from its specified 
behavior was quite a challenge. The article on page 22 describes the process and the tools involved in 
functional verification for the HP PA 8000 microprocessor. 

Once it is verified that a processor will perform according to its specifications, the next step is to char- 
acterize its behavior when it is pushed beyond its normal operating conditions. This process is called 
electrical verification, and its use for the HP PA 8000 is described on page 32. The article describes how 
shmoo plots are used to help analyze the results of varying different parameters, such as voltage and 
temperature, and the debugging effort that follows the discovery of an anomaly during shmoo testing. 
The layout of the interconnect metal for the HP PA 8000 required some new block routing technologies. 
These technologies are embodied in a tool called PA_Route, which is described on page 40. 

Telephone service today is more than just the transport of speech information some distance over tele- 
phone lines. Advancements in communications technology and deregulation in the telecommunications 
industry have meant the presence of more service providers competing to offer a wider range of ser- 
vices other than just voice transport. As a result of all these changes telephone networks have to be 
more "intelligent" than they were in the past. The articles starting on page 46 describe the HP OpenCall 
product, which is a collection of computer-based telecommunications platforms designed to offer a 
foundation for telephony services based on intelligent networks. The advanced telephony services offered 
today are carried on a separate signaling network from the voice transmission. The article on page 58 
describes the HP OpenCall SS7 platform, which allows customers to build signaling applications con- 
nected to the SS7 (Signaling System #7) signaling network. System reliability is something that customers 
connected to large-scale networks take for granted. The article on page 65 discusses active/standby 
feature provided in HP OpenCall for achieving fault tolerance and high availability. 

Because modern chemical analysis laboratories are so packed with instrumentation and other parapher- 
nalia, an instrument that provides some space economy is a big plus. The article on page 72 describes 
the first benchtop inductively coupled plasma mass spectrometer, the HP 4500. This instrument is one 
fifth the size of previous models and is small and light enough to be installed on an existing bench. The 
HP 4500 has a new type of optics system which allows the instrument to perform analysis down to the 
subnanogram-per-liter or parts-per-trillion (ppt) level. The application areas for the HP 4500 include the 
semiconductor industry, environmental studies, laboratory research, and plant quality control. 

Another essential aspect of a chemical analysis laboratory is the collection of data. With the array of 
instruments creating data and the requirements of many regulatory agencies, data collection in labora- 
tories has become quite critical. Fortunately, many of today's laboratory instruments are automated and 
connected to computer systems, making data collection a little easier. The problem is how to organize 
and store this data. The article on page 80 describes an object database management system that is 
used in the HP ChemStudy product for archiving and retrieving large amounts of complex historical 
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laboratory data. The article describes how historical data is managed and the mechanisms provided in 
the object DBMS for managing this data. 

One of the features of Asynchronous Transfer Mode (ATMI network technology is that it can satisfy the 
quality-of-service needs of many different types of network traffic. To provide this level of service, the 
ATM network must avoid network congestion, which causes unacceptable delays and data loss. Policing 
the network is one of the key mechanisms used by ATM to avoid congestion. Policing is responsible for 
monitoring the network to find potential congestion connections. If such a connection is found, policing 
can discard traffic from that connection Given the importance of policing, it is essential that the equip- 
ment responsible for doing the policing be thoroughly tested. The HP E4223A {page 901. is an application 
that is designed to test policing implementations in ATM switches before the switches are deployed for 
commercial service. The article describes network policing and explains how the HP E4223A works to 
test policing in ATM switches. 

The articles starting on page 96, are the last papers we have from HP's Design Technology Conference 
of 1996. The first paper explores the concept of using MOSFET scaling parameters, such as channel 
length and gate oxide thickness, to extrapolate scaling parameters for future MOSFET devices. The 
paper on page 101 discusses using clock dithering as an on-chip technique to reduce EMI. The paper 
surveys information from organizations inside and outside HP that have used clock dithering and fre- 
quency modulation as an EMI reduction technique. The next paper (pagel07) describes a proiect in 
which a third-party microprocessor design was ported via its hardware description language (HDL) 
specification instead of the traditional artwork port This approach has the advantages of allowing the 
processor to be optimized for HP's design process. The paper on page 114 describes circuit design tech- 
niques and design Irade-offs that were employed to design a 3V operational amplifier in the HP CM0S14 
process. The last paper Ipage 121) analyzes the affects of lids on heat transfer in flip-chip packages. 
The results from this analysis showed that although a lidless design shows better performance, more 
research is needed. 

C.L Leath 
Managing Editor 



Cover 

The cover shows the four-way superscalar HP PA 8000 microprocessor. 



What's Ahead 

Since there will not be an October issue of the HP Journal, December 1997 is the next publication date. 
The December issue will feature a new design for the HP Journal and 12 articles on a very timely topic: 
high-speed network communications The first two articles will discuss the future of this technology and 
its impact on society. The remaining articles will focus on HP's R&D efforts in this area, particularly fiber 
optics. The February 1998 issue will feature six more high-speed network communications articles 
These articles will focus on wireless communications. 
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Four- Way Superscalar PA-RISC 
Processors 



The HP PA 8000 and PA 8200 PA-RISC CPUs feature an aggressive 
four-way superscalar implementation, speculative execution, and 
on-the-fly instruction reordering. 

by Aiine P. Scot I, Kevin P. Burkhart, Ashok Kumar, Richard M. Blumberg, 
and Gregory L. Ranson 



The 111' PA 8000 and F'A S200 PA-RISt 1 ( PI Is are the Rrst 
implementations of a new generation of microprocessors 
from Hewlett-Packard. The PA 8000 1 :l is among the world's 
most powerful and advanced microprocessors, and at the 
time of introduction in January 1996, the undisputed perfor- 
mance leader. The PA 8200,' introduced in June 1997, con- 
tinues this performance leadership with higher frequency, 
larger caches, and several other enhancements. Both pro- 
cessors feature an aggressive four-way superscalar imple- 
mentation, combining speculative execution with on-tfte-fly 
instruct ion reordering. This paper discusses the objectives 
for the design of these processors, some of the key architec- 
tural features, implementation details, and system perfor- 
mance. The operation of the inslrticliiin ri'iiiilcr bitjlrr 
I IRB). : ' which provides out-of-order execution Capability, 
will also be described. 

PA 8000 Design Objectives 

The primary design objective for the PA 8000 was to obtain 
industry-leading performance on a broad range of real-world 
applications. To sustain high performance on large applica- 
tions, not just on benchmarks, we designed large, external 
primary caches with the ability l<> hide memory latency in 
hardware. We also chose to implement dynamic instruction 
reordering in hardware to maximize the instruct ion-level 
parallelism available to the execution units. Another goal 
was to provide full support for 64 bit applications. The pro- 
cessor implements the new PA-RISC 2.11 architecture, which 
is a binary compatible extension of the previous PA-RISC 
architecture. All previous code will execute without recom- 
pilation or translation, lire processor also provides glucless 
support for up to four-way multiprocessing via a high-band- 
width Runway system bus. 1 ' The Runway bus is a 7G8-Mbyte/s 
split-transaction bus thai allows each processor to have 
several outstanding memory requests. 

PA-RISC 2.0 enhancements 

The new PA-RISC 2.0 architecture incorporates a number of 
advanced microarchitectural enhancements. Most of the 
extensions involve support for 64-bit computing. Integer 
registers and functional units, including the shift/merge 
units, have been widened to bits. Flat virtual addressing 
up to t'A bits is supported, as are physical addresses greater 
than 32 bits (-40 bits were implemented on the PAS000). A 
new mode bit has been Implemented that governs address 
formation, creating increased flexibility, hi 32-bit addressing 



mode, it is still possible to lake advantage of (54-bil compute 
instructions for faster throughput In 64-bit addressing 
mode. -"32-bit instructions and conditions are still available 
for backwards compatibility. 

Other extensions help optimize performance in the areas 
of virtual memory and cache management, branching, and 
floating-point operations. These include fast TLB (transla- 
tion lookaside buffer) insert instructions, load and store 
instructions with 10-bit displacement, memory prefetch 
instructions, support for variable-sized pages, half-word 
insi met ions for multimedia support, branches with 22-bil 
displacements and short pointers, branch prediction hiiiling. 
floating-point multiply-aeeumulate instructions, floating- 
point multiple compare result bits, and Other carefully 
selected features. 

Hardware Design 

The PA S000 features a completely redesigned core that does 
not leverage any circuitry from previous-general ion IIP pro- 
cessors. This break from previous ( 'PCs allowed us to in- 
clude new microarchitectural features we deemed necessary 
for higher performance. Fig. I is a Functional block diagram 
of the processor show ing the basic control and data paths. 

The most notable feature of the chip, illu.sl rated in the Center 
of the diagram, is the industry's largest instruction reorder 
buffer of 50 entries, which serves as the central control unit 
This block supports full register renaming for all Instructions 
in the buffer, and tracks inlerdependencies between instruc- 
tions to allow data flow execution through the entire window. 

The PA S000 features a peak execution rate of four instruc- 
tions per cycle, made possible by a large complement of 
computational units, located on the left side of the diagram. 
For integer operation, two 64-bit integer ALUs and two 
<>4-bit shift/merge units are included. All integer functional 
units have a single-cycle latency. For floaliug-point applica- 
tions, dual floating-point multiply and accumulate (FMAC) 
units and dual divide/square root units arc included. The 
FMAC units are optimized for performing the very common 
operation A limes B plus C. By fusing ;tn add to a multiply, 
each FMAC can execute two floating-point operations in just 
tluee cycles. In addition to providing low latency for floating- 
point operations, the FMAC units are fully pipelined so that 
the peak floating-point throughput of the PA 8000 is four 
floating-point operations per cycle. The two divide/square 
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FMAC = Floating-Point Multiply and Accumulate 
ALU = Arithmetic Logic Unit 

Fig. i. functional block- diagram ofthe ill' PA 8000 processor, 

root units are not pipelined, bid other floating-point opera- 
tions can be executed on the FMAl' units while the divide/ 
s(|iiare root units are busy. A single-precision divide or square 
root operation requires 17 cycles, while double precision 
requires 31 cycles. 

Having such a large array of coniputalion units would be 
pointless if those units could not be supplied with enough 
data upon which to operate. To litis end, the I 'A SOOd incorpo- 
rates two complete load/store pipes, including two address 
adders, a f)t>ent.ry dual-ported TLB, and a dual-ported cache. 
The right side of Fig. 1 shows the dual load/store units and 
the memory system interface. The symmetry of dual func- 
tional units throughout the processor allows a number of 
simplifications in the data paths, the control logic, and signal 
routing. In effect, this duality provides for separate even and 
odd machines. 

As pipelines get deeper and the parallelism of a processor 
increases, instruction fetch bandwidth and branch prediction 
become increasingly important. To increase fetch bandwidth 
and mitigate the effect of pipeline stalls for branches pre- 
dicted to be taken, the I'A 8000 incorporates a 32-entry 
brunch target address cache, or BTAC. This unit is a fully 



associative structure that associates the address of a branch 
instruction with the address of its target. Whenever a branch 
predicted to be taken is encountered in the instruction 
stream, an entry is created in the BTAC for that branch. 
The next time the fetch unit fetches from the address ofthe 
branch, the BTAC signals a hit and supplies the address of 
the btanch target. The fetch unit can then immediately fetch 
the target ofthe branch without incurring any penalty, 
resulting in a zero-state taken branch penalty for branches 
that Mil ill the BTA( '. In an effort to improve the hit rate, only 
branches predicted to be taken are kept in the BTAC. If a 
branch hits in the BTAC but Ls predicted not to be taken, the 

entry is deleted. 

To reduce the number of mispredicted branches, the PA 8000 
implements two modes of branch prediction: dynamic mode 
and Static mode. Each TLB entry has a bit to indicate which 
prediction mode to use. Tints, the mode is selectable on a 
page-by-page basis. In dynamic prediction mode, a 25()-entry 
branch history table, or BUT. is consulted. The BUT stores 
the results of the last three iterations of each branch (either 
taken or not taken), and the instruction fetch unit predicts 
that the outcome of a given btanch will be the same as ihe 
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majority of the last three outcomes. In static prediction mode, 
the PA 8000 predicts most conditional forward branches to 
be unlaken. and most conditional backward brandies to be 
taken. For the common compare-and-braneli instruction, (he 
PA-RIS( ' 2.0 architect lire defines a branch prediction bit thai 
indicates whether this normal prediction convention should 
be followed or whether the opposite convention should be 
used. Compilers using either heuristic methods or profile- 
based optimization can use static prediction mode to com- 
municate branch probabilities effectively to the hardware. 

Cache Design 

The I'A HOIK) features large, single-level, off-chip, direct- 
mapped instruction and data caches. Both caches support con- 
figurations of up lo four megabytes using industry-standard 
synchronous SRAMs. Two complete copies of the data cache 
tags are provided so Ihat two independent accesses can lie 
accommodated Slid need not be U) the same cache line. 

Why did we design the processor without on-chip caches? 
The main reason is performance. Competing designs incorpo- 
rate small on-chip caches to enable higher clock frequencies. 
Small on-chip caches support benchmark performance but 
fade on large applications, so we felt we could make better 
use of the die area. The sophisticated IRB allows us to hide 
the effects of a pipelined two-state cache latency. In fact, 
our simulations demonstrated only a 5% performance im- 
provement if the cache were on-chip and had a single-cycle 
latency. The flat cache hierarchy also eliminates the design 
complexity associated with a Iwo-level cache design. 

Chip Statistics 

The PA 8000 is fabricated in HP's 0.5-micrometer, 3.3-volt 
CMOS process. Although the draw n geometries are not very 
aggressive, we still obtain a respectable 0.28-um effective 
channel length (L,.|r). In addition, extensive investment was 
made in lite design process to ensure Ihat both layout and 
Circuits would scale easily into more advanced technologies 
With smaller geometries. There are five metal layers: two for 
tight pitch routing anil local interconnect. I wo for low-RC 
global routing, and a final layer forelock and power supply 
routing. 

The processor is designed with a three-level clock network, 
organized as a modified H-tree (see article, page Hi). The 
clock sync signals serve as primary inputs. They are re- 
ceived by a central buffer anil driven to twelve secondary 
clock buffers located in slrategic spots around the chip. 
These buffers then diive the clock to the major circuit areas, 
where it is received by cluck yalrrs featuring high gain and 
a very short input-to-output delay. There are approximately 
7.000 of these gaters. which have the ability to generate 
many flavors of the clock: two-phase overlapping or non- 
overlapping, inverting or noninvcrting, qualified or nonquali- 
fied. The qualification of clocks is useful for synchronous 
register sets and clumps, as well as for powering down sec- 
tions of logic when not in use. Kxtensive simulation and 
timing of the clock network were done lo minimize clock 
skew and improve edge rates. The final clock skew for this 
design was simulated to be no greater than 170 ps between 
any two points on the die. 

1 uder nominal operating conditions of room temperature 
and 3.3-volt power supplies, (he chip is capable of miming at 




Fig. 2. I'A wiou cim: with major areas labeled. 



frequencies up to 250 MHz. Although we cannot guarantee 
processor performance based on results obtained under 
ideal conditions, there appears to be an opportunity for 
greater frequency enhancement. The die measures 17.68 mm 
by 19.1 mm and contains 3.8 million transistors. Approxi- 
mately 7896 of the chip is either full-custom or senucusiom. 
A photograph of the die with all major areas labeled is 
shown in Fig. 2. Again, the IRB is in the center of the chip, 
providing convenient access to all the functional units. The 
integer daia path is on the left side of the chip, while the 
right side contains the floating-point data path. 

By using flip-chip packaging technology, we were able lo 
support a very large number of I/O signals — 704 in all. In 
addition to the I/O signals, 1,200 power and ground solder 
bumps are eonnecled to the 1,085-pin package via a land 
grid array. There are fewer pins than the lotal of the I/Os 
and bumps because each power and ground pin can be con- 
nected to multiple bumps. A piclure of the packaged part is 
shown in Fig. 3. The chip is Hipped onto the ceramic carrier 
using solder bump interconnect, and the carrier is mounted 




Fig. 3. Packaged PA Rill ill CPU. 
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on a Conventional printed circuit board This packaging has 
several advantages. The wide off-chip caches are ntade ixjs- 
sihle by the high-pin-count caiwbility. The ability to place 
I/O signals anywhere on the die improves area utilization 
and reduces on-chip RC delays. Finally, the low inductance 
of the signal and power supply paths reduces noise and 
propagation delays. 

Performance 

At ISO MHz with one megabyte of instruction cache and 
one megabyte of data cache, the IIP PA S000 delivers over 
1 1.8 Speclnt95 and greater than 20.2 SpecFP95. making it 
the worlds fastest processor at the time of introduction. A 
four-way multiprocessor system has also produced 14. 739.0:1 
Tpini ($132JS5/fymC), where TpmC is an industry-standard 
benchmark for online transaction processing. That system 
configuration was made available in June 1990. 

Enabling the PA SOOO to achieve this level of performance 
are several distinguishing features. Firsi. there are a large 
number of functional units — ten. as described previously. 
However, multiple units alone are not enough. To sustain 
superscalar operation beyond two-way demands advanced 
instruction scheduling methods to supply a steady stream of 
independent tasks to the functional units. To achieve this 
goal, an aggressive out-of-order execution capability was 
incorporated. The instruction reorder buffer provides a 
large window of available instructions combined with a 
robust dependency tracking system. 

Second, having explic it compiler options to generate hints 
to the processor helps a great deal. These special instructions 
can be used to prefetch data and to communicate statically 
predicted branch behavior to the branch history table, as 
described previously. 

Finally, the system bus interface is capable of tracking up to 
ten pending data cache misses, an instruction cache miss, 
and an instruction cache prefetch. Since multiple misses can 
be serviced in parallel, the average performance penalty 
Caused by each is reduced. 

Instruction Reorder Buffer 

Because of restrictions on compiler scheduling, a key deci- 
sion was made to have the PA SOOO perform its own instruc- 
tion scheduling. To accomplish this task, the PA SOOO is 
equipped with an instruction reorder buffer, or IRIS, which 
can hold up to 50 instructions. This buffer is composed of 
two pieces: the Al.l buffer, which can store up to 28 Compu- 
lation instructions, and the MFM (memory) buffer, which can 
hold up to 28 load and store instructions. These buffers track 
over a dozen different types of interdependencies between 
Ihi' instruct ions they contain, and allow instructions any- 
where in the window to execute as soon as they are ready. 

As a special feature, the EBB tracks branch prediction out- 
comes, and when a mispredic tion is identified, all instruc- 
tions that were incorrectly fetched are flash-invalidated. 
Fetching then resumes down the correel path without any 
further wasted cycles. 

The IRIS selves as the central control logic for the entire 
chip, yet consists of only S">0.()00 transistors and consumes 
less than 20% of the die area. A high-performance IRIS is of 
paramount importance, since today's compilers simply lack 



run-time information, which is useful for optimal sc heduling 
The reorder buffer on the PA SHOO is ION larger than that of 
the nearest competitor. 

Instruction reordering also leads to the solution for another 
l>ottleneck: memory latenc y. Although the dual load/store 
pipes keep the computation tuiits busy as long as the data is 
c ache-resident, a data c-ache miss can still cause a disruption 
Execution can continue for many c yc les on instructions that 
do not depend on the data cache miss. The PA 8000 can exe- 
cute instructions well past the load or store that was missed, 
since lite IRB can hold so many instructions. When useful 
work can be acc omplished during a data cache miss latenc y, 
the net impact on performanc e is significantly reduced. 

The large window of available instructions also allows over- 
lap of multiple data cache misses. If a second data cache 
miss is delected while an earlier miss is still being serviced 
by main memory, the second miss will be issued to the 
system bus as well. 

Life of an Instruction 

A block diagram of the PA 8000s instruction render buffet- 
is shown in Fig. 4. Instructions enter through the soil block 
and are routed to the appropriate portion of the IRIS based 
on instruction type, where they are held until they retire. The 
functional units are connected to the appropriate section of 
the IRB based on what types of instructions they execute. 
After execution, instinct ions are removed from the system 
through the retire block. 

Instruction Insertion. The IRIS must be kept as full as possible 

to maximize the chances that four instructions are read} u> 

execute on a given cycle. A high-performance fetch unit was 
designed to maximize IRIS occupancy. This unit fetches, in 
program order, up to four instructions per cycle from the 
single-level off-chip irist ruction cache. 

I .muled predecode is then performed, and the inst ructions 
are inserted in a round-robin fashion into the appropriate 
IRIS. Each IBB segment must be able to handle- four incoming 
inst ructions per Cycle, since there are no restrictions on the 
mix of instructions being inserted. 
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There are several speeia) cases. Branches. although pxeculcd 
from the AH" IRB. are also stored in the MEM 1KB as a 
placeholder to indicate which entries to invalidate alter a 
mispredicted branch. Instructions that have both a computa- 
tion and a memory component and two targets, such as the 
(6ad word and modify (LDWM) instruction, are split into two 
pieces and occupy an entry in lioth portions of the IRB. 

Instruction Launch. Instructions are allowed to execute out of 
order. During every cycle, both segments of the IRB allow 
the oldest even and the oldest odd instruction for which ;dl 
Operands are available to execute on the functional units. 
Thus, up In four instructions can be executed at once: two 
computation instructions and two memory reference in- 
structions. Once an instruction has been executed, its result 
is held in a temporary rename register and made available 
I'm use by subsequent instructions. 

Instruction Retire. Instructions are removed or retired from 
the IRB in program order once they have executed and any 
exceptions have been detected. Enforcing strict retirement 
order provides software with a precise exception model. As 
instructions are retired, the contents of the rename registers 
are transferred to the general registers, stores are placed in 
a queue to be written to cache, and instruction results are 
Committed to the architected stale. The retire unit can handle 
up in two ALU or floatingpoint instructions and up to two 
memory instructions each cycle. 

The HP PA 8200 Processor 

Alter the successful introduction of PA 8000 processor-based 
products, the PA S( 100 design team initialed a follow-up pro- 
gram. Performance analysis on key applications identified 
Several opportunities for future products. The I'A 8200 t'Pl' 
team formulated a plan for improvement based on the 
following goals set by MB customers and management: 
Improved performance 
Compatibility with existing applications 

beverage of the PA 8000 design foundation 
Rapid time to market. 

Improved Performance. Application trace studies Identified 
branch prediction. TLB miss rates, ami increased cache 
sizes as significant opportunities. The availability of next- 
generalion IM-bit SRAMs with improved access limes 
allowed the design team to increase processor clock speed 
and double cache size to 2M bytes for both I he instruction 
cache and the data cache. The faster access time of -IM-bit 
SRAMs allowed higher processor clock rates without 
changes to the cache access protocol. The combination 
of increased clock frequency, larger caches, improvement 
of branch prediction accuracy, and reduction ol TLB miss 
rales enables performance improvements of 15".! to 30% on 
key applications. 

Compatibility with Existing Applications. Follow-on products 
using the PA 8200 hud to preserve our customers' inv est- 
ment in PA 7200-based and PA 8000-based software and 
hardware. Ii was considered essential to maintain binary 
compatibility with existing PA-RISC applications and pro- 
vide an upgrade path for improved performance. 



Leverage of PA 8000. The PA 8200 design leant leveraged the 
extensive functional and electrical verification results accu- 
mulated during the prototyping phase of the PA 8000 develop- 
ment. A wealth of design data is collected in the process of 
turning a design into a product. This information identified 
the paths limiting ( 'PI ' operating speed and the performance 
lituilers in the branch and TLB units. Characterization of the 
PA 8000 cache design provided the basis for a new design 
using high-speed -IM-bit SRAMs. 

Rapid Time to Market. The competitive situation dictated that 
speed upgrades to the PA 8000 were needed to maintain 
HP's performance leadership in the high-performance work- 
station and inidrange serv er markets. Therefore, design 
changes anil characterization of tin- expanded cache subsys- 
tem had to be completed within a very aggressive schedule. 

In the following sections, PA 8200 design changes to the 
PA 8000 processor will be detailed. 

PA 8000 Performance Analysis 

Given the goals of increased performance with low risk and 
a short time to market, ii was necessary to understand fully 
where the PA HOOD excelled and where significant improve- 
ments COUld lie made. Key customer applications were 
examined to determine how real-world code st remits were 
being executed on the PA 8000. 

For the PA 8000. the expectation was set that no code re- 
compilation would be necessary to see a 2 X speedup over 
the PA 7200. We did not want to change this expectation for 
the PA S200, so all code experiments were performed using 
noun-compiled, nonluned code. It was shown thai the 
PA 8200's performance could be enhanced significantly over 
thai of the PA 8000 by reducing the amount of lime the 
PA 8200 spent waiting for instructions or data. The branch 
history table (BUT) and translation lookaside buffer (TLB) 
are architectural features that are intended lo reduce wasted 
cycles resulting from penalties, particularly in pipelined 
machines. For mispredicted branches, TLB misses, and 
cache misses, the number of penalty cycles increased from 
the PA 7200 to the PA 8000. It was expected that a corre- 
sponding reduction in mispredictions and misses and the 
ability to hide penalty cycles using out-of-order execution 
would result in an overall decrease of wasted cycles. The 
analysis of the application suite showed otherwise, as (he 
number of wasted cycles increased from the PA 7200 lo the 
PA 8000, accounting for 30 to 86 percent of the total number 
of cycles spent on each instruction (CPI). If the number of 
mispredictions and misses could be decreased, a significant 
performance boost would be realized. As a result, increases 
in the size of the BUT, TLB, and caches were examined as 
potential high-benefit. low-risk improvements to the PA 8000. 

BHT Improvement 

The biggest performance weakness observed was the mis- 
predicted branch penalty. By its nature, out-of-order execu- 
tion increases Hie average penalty for mispredicted branches. 
Therefore, significant design resources were allocated for 
the PA 8000s branch prediction scheme to lower the mispre- 
diction rate, thereby offsetting the higher penally. The results 
of the performance analysis revealed that cycles wasted 
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because of branch penalties wrre still significantly impacting 
performance. Relative to the PA 7200. the misprediction rate 
is generally alxiut 50% lower across the sample workload of 
technical applications. However, the cycle penalty Tor a Mis- 
predicted branch rose b> 2W... more than offsetting the 
reduction in miss rate. There are clearly two possible solu- 
tions: decreasing the miss rate or decreasing the miss penalty. 
Because of the short time schedule of the program, mlefining 
how mispredicted branches .ire handled to reduce the penalty 
was not a viable alternative. The more practical solution was 
to improve branch prediction accuracy. 

lmpiovements to I he BUT focused on t wo areas. The first 
was the table size and the second was the branch prediction 
algorithm. The PA 8000 uses a three-bit majority vote algo- 
rithm and a 25(>-entry BHT. Since the PA 8000 also ;illows up 
lo two branches to retire simultaneously, the table ideally 
would be able to update two entries per cycle. Parallel BHT 
update was not implemented on the I'A S000. resulting in llii' 
Outcome of one of the branches hoi having its information 
entered into the BHT. Analysis of this limitation revealed a 
minor penalty that could easily he eliminated in the PA 8200. 

Initial investigation tor but improvements focused on the 

size of the (able since it is easier to increase the size of an 
existing structure than to start foam scratch and redefine the 
algorithm. To have the minimum impact on conirol logic, it 
was desirable to increase the table size by a multiple of two 
Visual inspection of the area around the BHT revealed that 
the number of entries could be increased lo ">I2 wilh little 
impact. 

Next, possible changes in the prediction algorithm were 

explored. I sing a more common algorithm became Ihe key 
lo allowing die BHT to grow lo 1024 entries. The new algo- 
rithm requires only two bits of data compared to the three- 
bit algorithm implemented on ihe PA sihio. Analysis of the 
two algorithms showed thai they result in almost the same 
predictions with only a lew exceptions The reduction in Ihe 
i lumber Of bits pel entT) from three to tWO allowed I lie BUT 
lo grow from ">I2 lo 102-1 entries. The increase from the 
algorithm change w as show n through simulation to provide 
more of an incremental improvement than waslosj by the 
switch to the two-bit algorithm. 

i >ne additional improvement was made lo Ihe BHT concern- 
ing ihe handling of multiple brandies retiring at the same 

lime Allowing iwo entries to he updated simultaneously 

required the data entries to have two write pons. Tins func- 
tionality was not included in ihe PA 81)110, so implementing a 
tWO-port SOllltiOn on the PA 8200 would be very expensive in 
die area Therefore, a control-based solution was devised. 
When two brandies retire on the same cycle, the information 
necessary lo update the cache for one of ihe branches is held 
in a one-entry queue. < In Ihe nexl cycle, Ihe dala in Ihe i|iieue 
is used to update Ihe table. If another branch also retires on 

the nexi cycle, the queue data is written into the but and 

the new ly retiring branch's data is stored in the queue. < >nly 
if i wo branches retire while the queue contains data is the 

data for one branch lost. This condition is considered lo be 
quite rare, since it requires that multiple pairs of branches 
retire consecutively, The rarity of this siiualion makes Ihe 

performance impact oflosing ihe fourth consecutive branch's 

data negligible. 



The risk Involved with making the described changes to Ihe 
BHT was relatively low. The data storage elements are well- 
understood slnicliurs and could l>e expanded with little risk. 
The control for the new BHT could mostly Ik- leveraged 
from the PA 8000 Implementation with ihe exception of the 
new branch store queue. Significant functional verification 
was done to ensure correctness ol die new BUT Since 
control and data paths remained almost Ihe same as the old 
BHT there was high co nf idence that the changes would not 
introduce new frequency limiters, 

TLB Improvement 

The second major area of improvement involved the TLB. 
Relative lo the PA 7200, the PA 8000 uses significantly more 
cycles handling TLB misses on most of the applications used 
lo analyze performance. The reason for this increase is 
twofold. First, the penalty for a TLB miss increased from 
26 cycles on the PA 7200 lo 07 c ycles on the PA 8000. The 
inc-rea.se in TLB miss penalty was mainly caused by an 

increase in control complexity resulting from the out-of- 
order capability of the PA 81)00. Second, the TLB miss rate 
for most of the applications examined also increased. The 
total number of entries decreased by 20% from 120 to !)(i 
between the PA 7200 and Ihe PA 8000. However, the PA 8000 
has a combined instruction and data TLB w hile ihe PA 7200 
has separate instruction and dala TLBs. At the time, a de- 
crease in size seemed an acceptable trade-Off since instruc- 
tion and data TLB entries could now use the entire TLB, 

Since the penalty (bra TLB miss Could not be reduc ed wilh- 
c mi significant redefinition of how a TLB miss is handled, 
the number of entries was the area of focus. Simulalion 
revealed that increasing the number of entries provided a 
nearly linear improvement in the TLB miss rale, leveling off 
at about 128 entries. In looking at the area the TLB occupied 
and the surrounding muling channels, it became clear that 
I2S entries would involve an unacceptable design risk. Since 
the implementation is most efficient in multiples <>fs. we 

next examined 120 entries. Initial examination of the art- 
work showed that Ibis laigel would be aggressive, yel rea- 
sonable. Simulations were done assuming 128 entries to 
provide some additional liming margin and lo allow for 
increasing to 128 entries if il became possible. Most of the 
circuil liming paths were found to have nearly Ihe same 
performance with 120 entries as 96 entries since the c ritical 
variable for liming is generally Ihe width of an entry and not 
the number of entries Some minor changes to transistor 
sizing provided the additional maigin necessarv on critic al 

paths that traversed the TLB array. The goal of these changes 
w as lo increase Ihe number of TLB entries over Ihe PA 800(1 
without impacting speed 

The biggest risk thai Ihe TLB changes posed was lo Ihe 

project schedule. The area affected by the c hanges was 

much larger than that of any oilier change, and there were 
hard boundaries to other functional units thai constrained 
design area. To increase the size of Ihe TLB. Iwo complex 
signal c hannels were rerouted. Although necessary lo pro* 
v ide Ihe additional room. Ihe changes were lime-consuming 
and presented significant schedule risk. Routing c hanges 
also increased Ihe chance of a Change in Ihe electrical per- 
formance of the- affected signals. To minimize this risk, a 



© Copr. 1949-1998 Hewlett-Packard Co. 



kugwi IB»7H»w|csn Packard JountaJ 13 



tool was written to verify Hial signal integrity was not com- 
promised. I herall. the rerouting of the channels was the 
critical path to tape release and also the highest risk. 

Frequency Improvement 

In addition to improving the BUT and TLB performance, the 
target frequency for the PA 8200 was increased over that of 
the PA 8000. We look a Iwo-pronged approach to timing 
analysis. The first approach consisted of analyzing a soft- 
ware model of PA 8000 timing and the second approach con- 
sisted of examining data from prototype systems in which 
we increased the frequency to the failing point. 

The PA 8000 liming was modeled using Epic's Pathmill and 
Timemill suite and Verilog's Veritime. These tools provided 
an ordered sel of pal lis ranked according lo predicted oper- 
ation frequency. We grouped the data into paths that were 
internal lo the chip (coir paths) and paths that received or 
drove information to the cache pins {cache paths). It hecame 
readily apparent that there was a very small sel of core paths 
and a much larger set of cache paths thai could potentially 
limit chip frequency. The core paths tended to he indepen- 
dent of all other core paths and could be improved on an 
individual basis within the CPU. The cache path limiters 
tended to funnel into a couple of key juncture points and 
could be globally improved by addressing those points. 
As an additional degree of freedom, cache paths could be 
addressed through a combination of CPU. board, and cache 
SRAM improvements. 

Once it was determined which core paths might limit chip 
frequency, we had to devise a method to correlate the simu- 
lated frequency with actual chip performance. Targeted bests 
were written lo exercise potential core limilers. Paths were 
chosen based on I heir independence from known limilers 
and for their ability lo be completely controlled by the test. 
The targeted tests ran consistenlly faster on silicon than the 
mode! predicted, giving us confidence that core paths would 
not be frequency limiters. 

We then looked al correlating cache paths between lite model 
and the system. Cache paths tend lo be multistate paths 
dependent on the timing of the cache SRAMs. Because of 
these attributes, it w-as not feasible lo craft a chip-level test 
lo exercise specific cache paths. Therefore, we decided to 
rely upon system data for determining worst -case cache 
paths and Ihen use the model data to show the frequency of 
other cache paths relative to the worst case. System work 
revealed two cache path frequency limit ers. Both paths 
were predicted by and correlated with the timing model. 

Based on the cache paths exposed through system work, an 
additional timing investigation was launched. Both paths 
funnelled into a similar sel of circuits lo send addresses to 
the SRAMs. All other inputs into those circuits were exam- 
ined and individually simulated using SPICE lo determine if 
they had the potential to become frequency limiters. From 
this effort, one additional set of inputs was identified as hav- 
ing a high risk of becoming a frequency limiter once the 
known limiters were improved. The proposed improvements 
to the known limiters improved the newly identified path as 
well, keeping it from becoming a critical path. 

The final step taken to understand the frequency limitations 
of the PA 8000 was to de\ise a way to look beyond the known 



limiting paths in a system. The lowest frequency Speed limiter 
was a cache path related to an architectural feature to im- 
prove performance. On the PA 8000, ibis feature can be dis- 
abled. However, the second speed limiter was not program- 
mable and was therefore capable of masking other paths. 
We turned lo focused ion beam (FIB) technology to help us 
solve this problem. 

The second speed limiter was a single-phase path that stalled 
with the rising edge of a clock and ended with the falling 
edge of a derived clock. By delaying the falling edge of I he 
derived clock, we could increase the frequency at which die 
path COUld run, creating a new region in which we could 
search for failing paths. We used the FIB to cut away and 
rebuild the circuitry for the derived clock. In the process of 
stripping away the metal on the chip and then redeposiling 
it lo rebuild the circuit, resistance is added, slowing down the 
circuit. We were able to add 220 ps to the path, increasing the 
failing frequency for this limiter by approximately 22 MHz. 
The FIB-modified chip was placed in a system for extensive 
testing. No additional failing paths were found in the newly 
opened frequency region. 

In improving the critical paths for the PA 8200, a conservative 
design approach was adopted. Most of the improvements 
involved moving clock edges, allowing latches to update 
earlier than before. Such changes can expose races or setup 
violations. The paths were carefully simulated to eliminate 
the risk of introducing a race. In cases where it was difficult 
to precisely determine the setup time needed for a signal, 
conservative changes were made. 

Cache Improvement 

Yet another area for improvement on the PA 8200 was the 
cache subsystem. The cache size plays an integral role in 
determining how well the system performs on both applica- 
tions and benchmarks. In addition, the off-chip cache access 
path can limit the operating frequency of the system because 
of the tight coupling between the CPU and the SRAMs. 

The PA 8000 offered a maximum cache size of 1M bytes for 
both the instruction and data caches. A total of 20 lM-bil 
industry-standard late-writ e synchronous SRAMs were 
employed for this configuration. The printed circuit board 
design was cumbersome because of the large number of 
SRAM sites. The design resulted in relatively long round-trip 
delays. As the PA 8200 was being defined, the nexl generation 
of SRAMs became available. These 4M-bii parts were fully 
backwards compatible with those used with the PA 8000. 
The emergence of these higher-density Components made 
possible a 2M-byte instruction cache and a 2M-byte data 
cache while reducing the number of SRAMs to 12. The 
resulting board layout was more optimal, contributing to 
shorter routes and belter signal integrity. 

In addition to cache size, the frequency limitation of the 
off-chip cache was carefully addressed. For much of the 
post -silicon verification of the PA 8000; the two-State cache 
access presented a frequency barrier that limited the amount 
of investigation beyond 180 MHz. Two main contributors 
allowed (he frequency of the PA 8200 to be increased well 
beyond 200 MHz. The first was the new SRAM placement and 
routing for the cache subsystem. Hie 12-SRAM configuration 
yielded a new worst-case round-trip delay thai was 500 ps 
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shorter than the 20-SRAM ronfigiiration previously tLsed. The 
second enabler was linked to the next-generation SRAMs. 
Not only did these parts provide four times the density, they 
also reduced their access times from 6.7 ns to 5.0 ns. The 
combined benefit of these two enablers resulted in raising 
the maximum cache-limite<l frequency from 180 MHz to 
290 MHz The value of this improvement was really twofold. 
First, it enabled system-level electrical characterization and 
< I'l core speed path identification in a space previously 
unexplored. Second, it resulted in a manufacturahle product 
lhal could meet the performance needs of our workstai ions 

PA 8200 Performance 

I iider nominal Operating conditions of room temperature 
and 3.3-volt power supplies, the PA 82(H) is capable of run- 
ning up to 300 MHz. 71) MHz faster than its predecessor. 
Table I summarizes its performance. 

Table I 

HP PA 8200 CPU Performance 
Estimated 

Benchmark Performance Frequency 

SPECmtSe 16.1 230 MHz 

SPECfoSS 2T..r. 230 MHz 



Conclusion 

The HP PA 8000 RISC ( "IT achieved industry-leading perfor- 
mance across a wide variety of applications by using an 
aggressive out-of-order design and carefully balancing hard- 
ware utilization throughout the system. The PA 8200 lever- 
ages that design, improving key areas identified by customer 
needs and applications. The number of TLB and BHT entries 
was increased, chip operating frequency was increased, and 
the cache configuration was updated to include the latest 
available SRAM technology . Together these changes improved 
system performance acnes customer applications up to 
23%, once again delivering industry-leading performance. 
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Design Methodologies and Circuit 
Design Trade-Offs for the HP PA 8000 
Processor 

This paper discusses the various design methods used in the PA 8000, 
specific design techniques for the new packaging technology, the clock 
distribution scheme, cross-chip signal integrity issues, and some of the 
new tools and techniques. 

by Paul J. Dorweiler, Floyd E. Moore, D. Douglas Josephson, and 
Glenn T. Colon-Bonet 



The increasing demands lor greater processor performance 
to remain competitive in today's computer market necessi- 
tate careful attention to the methods used in designing 
processors to achieve these performance goals. Processor 
designs are increasing in complexity to meet performance 
goals, with such features as out-of-order execution and super- 
scalar operation. Design cycles are decreasing in length, so 
design quality must increase as well. All of these factors call 
for new design techniques to ensure continued success. 

'Phis paper will present some of the design methodologies 
and choices used in the design of the HP PA 8000 CP1 '. the 
first HP processor to implement the PA-RISC 2.0 architecture 
and the first capable of fi4-hil operation. The various design 
methods used in the PA 8(1(10. specific design techniques for 
the new packaging technology used, the clock distribution 
scheme, and cross-chip signal integrity issues will be dis- 
cussed. We will also present some of the new tools and tech- 
niques employed by IIP to ensure a high level of quality on 
first silicon, based in large part on our experiences with 
previous PA-RISC microprocessor designs. 

Design Trade-Offs and Methodologies 

Processor design is a continuous series of trade-offs between 
die area, complexity performance, Speed) power use, and 
design time. Given the complexily ofa four-way oul-of-order 
proccssor such as the PA 8000, it is not appropriate to employ 
the same circuit design techniques for all blocks on the chip. 
For the PA 8000, three major circuit design techniques were 
used. 

The lust is the traditional Static design approach, in which 
all output signals are held true as long as the inputs to the 
static cell remain constant. Storage of values, or Stale, is in 
latches, and logic functions are implemented using a variety 
of different logic blocks, allowing minimization of area or 
path evaluation time. Since static logic is fairly immune to 
noise effects (at least on a local basis), this is the safest 
design approach. Frequently this is also the design approach 
that needs the fewest engineering resources. The synthesis 
and layout steps can be accomplished by automated tools. 



with oversight by the designer to ensure that the block satis- 
fies requirements, timing paths are met. electrical rules 
(Mich as metal elect immigration) aren't violated, and so on. 

Sialic design techniques are not ideally suited for large fan-in 

and fanout functions. Because of their pullup^iulldown 

design, static gales are not the fastest evaluation method for 
certain high fan-in/fanoul applications. Single-roil dynamic 
logic or domino logic is belter suited to these applications, 
particularlv OR functions. A good example of such a function 
is the operand dump lines from register files. For an out-of- 
order processor with operand data coming from both rename 
and architected stale registers, the number of drivers on one 
bus is quite large. In the case of the PA 8000 there are oti 
rename registers and :J2 architected stale registers on both 
the integer and noaling-poinl sides. Trying to driv e a single 
bus with 88 sialic drivers is a much more difficult task than 
using single-rail dynamic logic. The lower capacitance of 
simply using an n-channel FET driv er and a bus precharger 
for the nondump state helps tremendously in this instance, 
Sialic logic will also consume more area to implement these 
types of functions because il requires extra p-channel FET 
pullup trees in each block. However, dynamic logic is more 
susceptible to noise, requires more careful design attention 
than sialic logic, will in general use more power, and since il 
is a clocked mechanism, also increases the clock load. This 
type of logic is employed in the data path portions of (he 
PA SIM 10. 

Single-rail dynamic logic does fail in some instances, par- 
ticularly when Hying to use the inversion ofa value in the 
middle ofa logic chain, or using an AND function. In litis in- 
stance and where sialic logic is not fast enough, a il iml- roil 
dynamic logic scheme can be employed. In this type of logic, 
both the positive sense and the negative sense ofa signal 
are derived, both in a lovv-go-high fashion.* Inversions are 
accomplished simply by switching the low-sense and high- 
spnse signals between gales. This logic can be quite fast 

• Lowgoriigh roea"* that the signal starts at the ground voltage and transitions only once 
during an evaluate slate lo Ihe supply voltage Vuu 



16 Angus! iwiwihi Piui;anl.l<mrnai 

©Copr. 1949-1998 Hewlett-Packard Co. 



since the design of the gates optimizes one transition edge 
and dynamic techniques are employed in the pulldown trees 
Of the logic gates. In addition, since timing information is 
included with the transition of one or die Other output sense, 
it is a self-timed mechanism By employing latches that 
sense just the first transition of an output pair, this type of 
logic can he p ip el in ed and used in multiple stages. Dual-rail 
dynamic logic- doe? consume a large amount of area Arid 
power, and therefore was employed only in the most time- 
critical portions of the PA 8000. most notably the floating- 
point execution units. 

Alpha Particle Sensitivity 

The decision to use lead solder hump technology to enable 
flip-chip die attach for the PA 8000 presented a new design 
challenge for the learn. Previous designs were all wire- 
I Minded dice in ceramic pin-grid array packages il'P(iA) 
To prevent alpha particles I which are identical to helium 
nuclei) emanating from the package or wire bonds from 
upsetting sensitive stnrage nodes within the processor, a 
silicon compound is used on the die surface. The flip-chip 
attach method, however, places arrays of mostly lead (Ph) 
hemispherical bumps over a significant portion of the die 
surface. The bump material contains some heavy elements 
that are radioactive and the decay 6f these elements pro- 
duces alpha panicles and beta and gamma rays that can 
cause a shii/lr-crml upset Of a sensitive storage node. 

The single-event upset is a high concent in integrated cir- 
cuits because a change of stale of a storage node can have 
serious consequences tor executing programs. Any alpha 
particle that leaves the solder hump lias sufficient mass and 
energy to cause an ionized trail of hole-electron pairs that 
create mobile charges that c an flood a positively charged 
storage node and cause an unintended state change of a 
memory element To minimize this iindesired event, certain 
design changes were adopted for PA 80(10 memory circuits. 

A SPICE current pulse model that Simulated the behavior of 
an alpha particle was derived from both empirical measure- 
ments mi existing products and simulation using IC process 
modeling software. A design role for the minimum storage 
charge (Q, ) was set and all storage nodes were de- 
signed to meet the new guideline, then verified by SPK'E 
simulations using the alpha particle current pulse model. 

Clock Distribution Scheme 

III a high-frequency design such as the PA 8000. minimizing 
croSS-Chip clock skew is c ritical to ensure lite maximum 
amount of lime for logic and data path operations to com 
plete. l-ack of attention to cloc k distribution for the entire 
chip will result in a lower frequency of operation and more 
design resources being spent on reducing delays in budgets 
that contain c ross-c hip paths. Excessive cloc k skew also 
increases the likelihood of Introducing races into the de- 
sign thai « ill need to be identified and fixed. For these rea- 
sons a considerable amount of effort was spent in the inves- 
tigation and design of the cloc k distribution scheme For the 
PA 8000. 

Also affecting c loc k skew across the chip is lite ainounl of 
load OH the global c loc k signal. Willi single-rail and dual-rail 
dynamic circuitry in the data path sections, the overall clock 




Kig. l. iftr tistritradori network. 

load is greater than it would have been had only static- Cir- 
cuitry been used. This places an additional burden on lite 
clock distribution network because skew Increases with 
load for a given clock network definition. 

The clock distribution method employed on the PA 8000 is 
an ll-tree metal Structure (see Fig. 1 ) to deliver the clock 
Signal from the VA solder bumps lo a first-level on-chip cloc k 
receiver. The Output of this receiver is then routed using 
matched wire lengths to a second level of clock buffers, 
with each buffer carefully positioned on the chip and the 
OUtpul load of each buffer matched as closely as possible. 
Given the large size of the die for the PA 8000 ( 10.2 by 
17.8 mm), process variation will inevitably make the FETs 
used in these second-level clock buffers unequal in Strength. 
The design of these buffers attempted to minimize this 
speed variation. A graph of Ihe overall skew using the final 
c lock distribution scheme is shown in Fig. 1. Using litis 
design. Ihe overall cloc k skew across the die was held lo 
170 picoseconds. 

From the second-level clock buffers, careful attention was 
paid lo the routes of Ihe buffered cloc k Outputs in ihe next 
level of circuitry. To minimize the power dissipation of the 
chip and provide nonoverlapping clocks to control blocks, 
controlled buffer bloc ks c alled clack t/alcru are employed. 
I ttfJferenl types of c loc k galers can generate- overlapping and 
nonoverlapping cloc ks, and each size ofgaler is rated for 
a specific- amount Of output load. Checks were performed to 
ensure- that the proper loading was maintained on all gain 
OUtpUtS; Since the clock oUlpUtS for these gater blocks were 

guaranteed to a certain specification only under a rated load 

range: Whenever possible, the clock gaters were qualified 
with control signals lo strobe their clock Outputs only when 
necessary. This allows the c loc ks for various functional 
units lo be clocked only when actual work needs lo be done, 
reducing overall chip power dissipation. 

Tinting 

To ensure high-frequency operation and a short posi-lape- 
release period, vigorous liming checks were employed by 
PA 8000 block and top-level de-signers. The liming effort on 
the PA 8000 was far greater than on previous HP processors, 
and was a significant fac tor in produc ing first Silicon thai 
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Fig. 2. Clock skew topograph? 
map. 



ran at the targeted design frequency from the first boot of 
the operating system. 

The size of the die complicated lop-level liming analysis 
because the sheer distance some signals had to travel added 
significant delay to cross-chip budgets. Over-the-block rout- 
ing was necessary given the large number of top-level signals 
present on the chip. Noise and capacitance to metal layers 
inside of the blocks being routed over had to be factored 
into the top-level timing analysis. 

Repeaters were employed on the PA 8000 for long-route, 
timing-crilical signals to reduce the delay and allow for faster 
signal edges. In some cases this was accomplished with one 
noninverting buffer, and in other cases split inverters along 
the route were used. Where possible, single inverters were 
used in cross-chip paths if this level of inversion could be 
absorbed by the receiving or driving logic, thus speeding up 
these paths. 

Block designers ran liming simulators, both path-driven and 
stimulus-driven, to check the internal liming of their blocks 
and to verify that their published drive and receive times for 
global signals were valid. Close to tape release, a large effort 
was put into driving down the number of slow cross-chip 
paths, which threatened the frequency goal of the PA 8000. 

hi addition to the timing checks performed on the PA 8001), 
ot her quality checks were performed to detect potential 
problems discovered on previous processors. The checks 
will be described in I he remainder of this article. Most of 
these problems are related to noise events on signals and 
supplies that trip sensitive circuitry, causing failures. 

Latch Margin Checks 

Latches are an important part of any processor design. A 
large amount of state information about a currently running 
program needs lo be slored. Control logic and data paths 
both employ latches to a large degree. Latch designs trade 
off setup, hold, and in-to-out delay times by optimizing the 



size of various FETs in the latch structure, particularly the 
feedback inverter, which holds the state of the latch and 
must be overcome to change the state. The PA 8000 design 
employs transparent latches in which the input signal passes 
through a series n-channel FET and thus suffers a gate 
threshold voltage drop as well. 

Since changing the state of a latch inadvertently is potentially 
disastrous, avoiding poor latch designs was a critical design 
goal. For this reason, a specific tool was developed to ana- 
lyze the electrical margins of a latch and was run on all the 
latches on the PA 8000. The complexity of this tool grew 
from a desire to be able to check both full and half latches. 
A full latch consists of two cross-coupled inverters while a 
half latch has a single FET connected to the inverter output 
(see Fig. 3). 

The latch check program evaluated the set drive path to 
determine if il was strong enough to overcome the feedback 
FETs. Since the input drive signal must be known to accom- 
plish this evaluation and extracting this drive signal from all 
of I he places where latches are used is a rat her complex 
task, the program had to make some assumptions about the 
driving block when run only on die latch cell. For critical 
paths or latches with particularly small margins, the actual 
driving path was placed into a small schematic and the 
program was run on this schematic to ensure that the latch 
was acceptable. 

Signal Noise Checks 

In implementing the PA 8000, additional levels of inter- 
connect were required with finer geometries than had been 
used on past designs to connect the blocks on the chip 
together. This posed a number of problems in guaranteeing 
that the design would be electrically robust at the high fre- 
quencies at which the PA 8000 operates. Experience during 
electrical characterization of previous designs indicated that 
internal signal integrity would be a serious issue for the 
PA 8000. 
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Fig. 3. "I\m> lypos of laltlu's. (a) Half laicli. (I>) Full latch. 

Signal Integrity Issues in Advanced Processes 

Three major problems arise with iiuerconnecl as processes 
continue their inexorable march toward smaller dimensions 
and hitter frequencies; 

Signal cross lalk is very significant at the 0.5-mn process 
generation and beyond. 

Signal rise and fall limes decrease as transistor speed in- 
creases. 

Signal coupling increases because smaller dimensions are 
used for interconnect The smaller dimensions especially 
increase coupling between metal lines on the same inlcr- 
COnneCt layer. 

Signal cross lalk (noise effects) includes both capacilive and 
inductive components. In the equations i = Cdv/dt and v = 
Ldi/di, all of I he factors — (', 1„ dv/dt. and cli/dt — are increas- 
ing with decreasing interconnecl dimensions and faster tran- 
sistors. This leads to voltage and current disturbances in 



lilies that couple to adjacent metal lines through mutual 
capacitive and inductive effects. An example of an intercon- 
nect and circuit topology that can cause these problems is 
shown in Fig. 4. 

Very fast edge rates require high transient currents (tens of 
;uii|>eies I from the off -chip and on-chip power networks. 
High currents are also present in the main c lock network on 
the c hip. Power supply networks require careful design to 
minimize inductive and capacitive effects on voltage levels. 
Clock nets also need to maintain good voltage levels as well 
as minimize clock skew delays between various bloc ks. 

Solving Signal Integrity Problems 

Different approaches can be used to solve signal integrity 
problems. In general, combinations of the following tech- 
niques were used on the PA 8000; 
> Adjust spacing of signals relative to each other 
i Include shields above and below signals 
i Include restoring logic (repeaters I in the mule 
1 Design signal receivers that reject noise events. 

A key component of the effort to correct signal integrity 
problems is a loolset that can be used to identify them in 
the first place. This toolset needs the ability to do RG extrac- 
tion and the ability to identify circuit topologies thai may 
be susceptible to noise problems. RC extrac tion allows 
delerminalion of the extent of possible coupling problems. 
By combining it with identification of susceptible circuits, 
solutions to problems can be implemented. 

To identify Circuits with noise susceptibility, an existing 
internal tool was heavily modified and extended to allow 
easy traversal of the current schematic- or ai l work nellisi 
hierarchy. This tool could display all connections of a given 
signal down to the transistor level, including information on 
VET sizes and estimates of capacilive loading (from sche- 
matics) or extracted capacilive load (from artwork). Infor- 
malion on port directionality and oilier text properties added 
by the block designer could also be displayed, as well as 
what terminals of a FET are connecled to the signals. One 
additional important feature of the tool was thai il could 
track any changes in real time, as soon as Ihey were made 
by designers. This tool was used for many purposes by 
designers in addition to its use in noise checks. 
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The latching inelhodology used on the I'A 800(1 has a poten- 
tial failure mode: exclusions Ofa signal beyond a supply rail 
(e.g.. belOW local ground for a given Latch ) could cause the 
latch to lose its value. An example of this is illustrated in 
Fig. ">. The latch shown is holding a high value — node INI is 
ai V|i|i. held by I he weak feedback inverter. If the victim line 
is at 0V and Ihe culpril lilies are at Vpu and transition lo (IV 

quickly, an excursion of the input signal below local ground 

is possible, induced by capacilive coupling from Ihe Culpril 
lines to the Victim line as Hie culpril lines Iransilion from 
1 loll. This input signal excursion can cause the n-channel 
FET pass gate lhal serves as Ihe Input lo Ihe latcll lo turn on 
even though its gate is held at OV (V<; S lor Ihe Iransislor is 

greater than V'r\ l. This is because the victim Input is tempo- 
rarily below local ground. Willi this n-channel FKT pass gate 
on. Ihe latch can spuriously dump Ihe value it was holding 
bj discharging Ihe IN I node il'ihe transient is enough lo over- 
come the feedback Inverter and trip the forward inverter. 
This type Of failure may change ihe state of Ihe chip, and is 
a serious problem lhal must be avoided. 

( Ither possible problem circuits were also identified by litis 
tool, including heav ily ratioed 1 combinations ofp-channel 
FKTs and n-channel FKTs and long routes connected lo gale 
inputs of pass FKT latches. Howev er, diffusion-connected 
inputs were the most common problems. To identify diffu- 
sion-connected inputs, the netlisl traversal tool was run on 
every top-level signal in the design. The tool identified top- 
level signals connected lo Ihe source or drain Ofa pass FKT 
in a latch. This gave a textual report of all connections down 
to the FKT level for every lop-level signal, in addition lo Ihe 
FKT terminal connections and w hether the signal was an 

input or output of thai particular leaf cell. 

Once the report was generated, a parser analyzed Ihe connec- 
tivity to determine if any signal connected to a FKT diffusion 
was also an input (outputs of leal' cells were ignored). < Ither 
checks w ere performed for additional suspect circuit topol- 
ogies. When potential problem signals were identified, the 
information was integrated with Re extraction results to 

' Heavilv ratioed combinations are combinations ot inverters and other FETs in wtuch the 
effective p-cliannel Ft I iinve strength is siynilicantlv ililleient (Ml 'he ellective n-channel 
f FT drive strength 
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determine priorities for fixing signals, and the results were 

distributed lo designers to give them feedback on Which 
signals in their blocks needed in be fixed. Extensive simula- 
tions showed lhal only routes longer than a specified 

threshold length w ould need to be fixed. This threshold 

gave designers a limit at which they would have lo do some- 
thing to reduce susceptibility lo noise on a signal being re- 
ceived by their block. 

In most cases, designers used one of the techniques de- 
scribed above to alleviate these noise problems. The mosl 
popular solution inserted a restoring inverter in Front of the 
pass gale and modified the lalch Slightly lo make il logically 

equivalent to the latch that needed to be replaced) as shown 
in Fig. li. The restoring inverter in front of Ihe pass FKT 
makes the lalch far more immune lo noise events on Ihe 
input At other times, repeaters (inverters and buffers) were 

inserted in roules to cut down the distance of the route, thus 
reducing Ihe susceptibility ofa given line lo transitions by its 
neighbors. 
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Signal Integrity Results 

' Kerall. the techniques described above were effective in 
eliminating noise-induced electrical failures m the PA SOOO 
design, and probably saved several months of characteriza- 
tion to investigate noise failures that would have existed had 
this tool not I teen developed. < Ker 71 M to |)otential problems 
were flagged with the first run of the tool. All of these prob- 
lems were investigated and either fixed or waivered before 
tape release. The PA SOOO was a very electrically robust de- 
sign gfwttl its complexity level when silicon was received. 

< )ne drawbac k of this tool was thai it was only run on top- 
level signals. Since some of the blocks on the PA SOOO were 
very large, long routes and therefore noise problems could 
also be embedded inside blin ks. < die such problem was 
found during characterization of the chip at the block level. 
We are currently extending the noise analysis tools to Oper- 
ate at deeper levels throughout the chip hierarchy to thor- 
oughly check all signals on the chip. KC extrac tion is being 
extended to allow deeper levels of extraction without long 
run times, and inclusion of inductive effects is also being 
investigated 

A limitation of this type of tool is that it can generate a lot 
of noise, that is. report problems that really aren't problems. 
This affects designer productivity because the problems 
reported by the tool must be investigated. However, the 
penalty and cost for finding a noise problem in a design can 
be very high, especially late in the characterization process, 
so effort spent early to eliminate possible problems is very 
worthwhile. We are currently developing more advanc ed 
tools to eliminate some of this noise and make sure that only 
problems serious enough to warrant fixing are Included 

Block Quality Checks 

Block design, especially for complex blocks, is a lime 
consuming process In which — despite the best intentions 

of the designer — problems can sneak through without being 

noticed For this reason several additional tools were devel- 
oped to allow designers to check for potentially troublesome 
circuits in then blocks. 



One tool checks for so-called "ugly" |>olysilicon structures, 
(iiven the resistance of the polysilicon layer in the HP pro- 
cess used to fabric ate the PA 8000. long polysilicon routes 
are undesirable and c an cause numerous problems, chief 
among these being slow Speed Standard cell rooted bloc ks 
suffered less from this problem because the routers em 
ployed used only metal layers for signal interconnect Long 
polysilicon problems occurred primarily in scniicustnm and 
full-custom designs. This tool flagged polysilicon routes 
between 25 and 50 micrometers long as warnings and over 
50 niiciomelers as emirs. 

With the significant use of clock gaters to create many 
different flavors of clocks, both ov erlapping and nonover- 
lapping. rac es were expected to be more prevalent in the 
PA 8001) design. Pass-gate blocks in particular c ause these 
types of problems. Clock-qualified signals I signals derived 
from clock edges l driving other clock-qualified nodes Were 
Checked to cover signal races not detectable by the previous 
race checking methodology used in PA-RISC processor 
designs. 

Summary 

All of (he techniques described above helped to make the 
PA 8000 processor a successful project , achieving its fre- 
quency, performance, and aggressive post -tape-release 
schedule. This was a great achievement giv en the sheer 
complexity of the design, the lad that it was a new proces- 
sor architecture, and the number of new technologies 
employed in the design. This success is due in large part to 
the design methodologies used for this processor, particularly 
the new methodologies developed for the PA SOOO design. 
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Functional Verification of the HP 
PA 8000 Processor 



The advanced microarchitecture of the HP PA 8000 CPU has many features 
that presented significant new verification challenges. These include 
out-of-order instruction execution, register renaming, speculative 
execution, four-way superscalar operation, decoupled instruction 
fetch, concurrent system bus interface, and PA-RISC 2.0 architecture 
enhancements. Enhanced functional verification tools and processes were 
required to address this microarchitectural complexity. 

hy Steven T. Mangelsdorf, Raymond P. Gratias, Richard M. Blumberg, 
and Rohit Bhatia 



Computer system performance has been improving recently 
al a rale of 10 lo (ill percent per year. This growth rale has 
been fueled by several factors. Advancements in integrated 

circuit technology have made higher microprocessor clock 
rales and larger caches possible. There have been contribu- 
tions from system software as well, such as compilers thai 
emit more efficienl machine code to realize a given function. 
The PA-RISC instruction set architecture has evolved lo keep 
pace with changes in technology and customer workloads. 

These factors alone, however, would not have been sufficient 
to satisfy customer demand for increased performance in a 
very competitive industry The balance has bee!) made up by 
innovations in microarchitecture thai increase the amount of 
useful work thai a microprocessor performs in a clock cycle. 
This has increased the complexity of the design and thus the 
effort required for successful functional verification. 

Many of our previous microprocessor projects have reused 
existing cores (although generally with significant modifica- 
tions and enhancements). In contrast, the III' I 'A Sill Id (IT 
has a new microarchitecture that borrows little from pre- 
vious projects. Some of the features in its microarchitecture 
presented Significant new verification challenges: 
( )ut-of-ordcr execution. A 50-entry queue of pending insl ruc- 
tions is maintained by an instruction reorder buffer (IRB). 
The queue hardware selects instructions for execution that 
have their operands available irrespective of program order. 
Register Renaming. Wriie-afler-write and w lite-after-read 
ordering dependencies are eliminated by remapping refer- 
ences from an archileciured register to a temporary register. 
Speculative Execution. The I'A SOI II) predicts whether a 
branch is taken and can tentatively execute instructions 
down the predicted path The side effects of all such 
instructions musi be canceled if the prediction turns out 
to be incorrect. 

Four-Way Superscalar ( iperation. The PA 81)00 has ten 
functional units and can sustain an execution rate of four 
instructions pet cycle. 

Decoupled Instruction Fetch. Instructions are fetched and 
inserted into the queue by an autonomous instruction fetch 
unit (IFF). The IFF performs branch prediction and caches 



the target addresses of recently taken branches in ulimiiclt 

target address eaehe (BTAC). 

• Concurrent System Bus Interface. Memory requests can be 
issued out of older, and data returns can be accommodated 
out of order. Up to 10 requests can be outstanding at a lime. 

• PA-RISC 2.0 Architecture Enhancements. These provided 
important new capabilities, such as Ill-hit addressing and 
Computation, but I hey necessitated tool rework and limited 
reuse of existing lesl cases. 

This paper describes the enhanced functional verification 

tools and processes that were required lo address the daunt- 
ing microarchitectural complexity of the PA SHOO. 

Verification Overview 

The purpose of functional verification is to identify defects 
in the design of a microprocessor that cause its behavior to 
deviate from what is permitted by the specification. The 
specification is the PA-RISC instruction set architecture 
and the bus protocols established by industry standards or 
negotiated with the designers of other system components. 
Performance specifications, such as instruction scheduling 
guidelines Committed to the compiler developers, may also 
be considered. 

Although it is not possible lo prove the correctness of a 
microprocessor design absolutely through exhauslive simu- 
tatiOn or existing formal verification techniques, a functional 
Verification effort must achieve two things to be considered 
successful. First and foremost, il must provide high confi- 
dence that our products will meet the quality expectations 
of our customers. At the same time, il must identify defects 
early enough in the design cycle lo avoid impacting the 
product's time to market. 

A typical defect caught early in the design cycle might cost 
only one engineering day to debug and correct in the RTI. 
(Register Transfer Language). Close lo lape release, il might 
lake five CO len days lo modify transistor-level schematics 
and layout, modify interblock routing, and repeat liming 
analysis. Therefore, tape release can be delayed if the defect 
rate is not driven down quickly. 
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After tape release, lost calendar time is ihe primary cost of 
defects beCatlSe the One required lo fabricate a new revision 
of the design is at t>est a few weeks and at worst a few 
months. Defects that are so severe that they block a soft- 
ware partners development, tuning, or testing efforts can 
put them on the critical schedule patJi of Ihe product. The 
worst-case scenario is a masking defect that blocks further 
testing efforts for a certain functional area of the design, anil 
this delays the discovery of additional defects by Ihe lime 
required to fabricate a new revision. One or more masking 
defects in series can quickly devastate Ihe product schedule. 

The PA 8000 verification effort consisted of a prcsilicon phase 
anil a postsilicon phase. The purpose of the prcsilicon phase 
Wits lo find defects concurrenily with Ihe design, when Ihe 
cost of correcting them was small, and to drive up (he quality 
level al lirsi tape release so thai the firsl prololypcs would 
be useful lo our software partners. This was done using 

three tactics: rtl simuhuion. accelerated simulation, and 

Switch-level simulation. The postsilicon effort Consisted of 
aggressive characterization of hardware prototypes 10 com- 
plete verification before systems were shipped to customers. 
Also, performance verific ation was done al various stages in 
the project. 

RTL Simulation 

Most previous PA-RISC microprocessor projects have built 
iheir Functional verification efforts around an internally, 
developed RTL simulator thai compiles RTL descripl ions 
of blocks into (' code which are Ihen compiled with HP's 
(' Compiler. Block execution is scheduled dynamically using 
an event -driven algorithm. This simulation technology 

achieves niodcsi performance (about o.r> iiz running on a 
ivpical workstation), bm it does provide capabilities for rapid 
prototyping such as Ihe ability lo simulate very high-level 



RTL and quick model builds. Therefore, our RTL simulator 
became the cornerstone of our verification effort early in 
the design. 

Fig. 1 shows the verification environment used for RTL simu- 
lation. There are four baste components in the environment. 

• Tin- RTL model for the PA S0O0. 

• Bus emulators, which can apply interesting stimulus to Ihe 
input buses of the PA S<MM) including responses to its trans- 
actions. We included emulators for all components sharing 
the system bus including the memory system. I/O adapter, 
and third-party processors. 

• ( becking software, which monitors the behavior of the 
PA 8000 and verifies thai ii complies with the Specifications. 
This also helps speed debugging by flagging behavioral 
violations as soon as they occur. 

• A variety of lest case sources and tools that can compile 
the test cases into an initial state for ihe PA 8000 model and 
configure the bus emulators. 

Checking Software 

The most important clun k is a thorough comparison between 
instructions retiring in Ihe PA S000 model and instructions 
retiring in the PA-RISC architectural simulator. Retiring 
means exiling the instruction reorder buffer, or IRB (see 
article, page 8). A tool called ihe depiper captures informa- 
tion about each instruction retiring in the PA 8000 model, 
including what resources (such as destination registers) 
are being modified and Ihe new values. The synchronizer 
compares this with similar information obtained from the 
PA-RISC architectural Simulate! whic h is also running the 
same test case. This provides very high confidence thai the 
PA 8000 complies with tin- basic PA-RISC instruction set 
architecture. A final-slate comparison of all processor and 
memory stale informal ion is also done al Ihe end of each 
test case. 
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The depiper also provides ihc synchronizer with information 
about architecturally transparent events such as cache 
misses. Using this information, the synchronizer can perform 

StTOng checks in the areas of cache coherency, memory 
access ordering consistency, and incmory-lo-cnche transfers. 
In addition; a number Of checkers were developed for other 
areas: 

A checker for the instruction queues, including whether 
the order in which instructions are sent to functional units 

complies with data dependencies 

A checker for protocol violations on the system hus 

A checker for the hus interface block, diSCUSSed in more 

detail helow 

A checker that detects unknown ( X ) values on internal 
nodes. 

Test Case Sources 

A lest case is essentially a test program to he run through the 
RTL model of the processor to stress a particular area of 
functionality. These are generally written in a format similar 
lo PA-Rist ' assembly language, with annotations to help 
specify initial cache and TLH contents. In addition, a control 
file can he attached to a lest case to specify the behavior 
of the bus emulators. The emulators have useful default 
behavior, but if desired the control Hies can precisely 

control transaction timing. 

A lesl case is compiled using a collection of tools that in- 
cludes the PA-RISC assembler. The result of the compilation 
is a sel of stale initializations for the RTL model. These 
include the processor registers, caches, TLB. and memory. 
In addition, the bus emulators arc initialized with Ihi' com- 
mands they will use during execution of the test case. 

Previous PA-RISC microprocessor projects had built up a 
library of lesl cases and architectural verification programs 
(AVPs). Although we did run these, it was dear from the 
beginning that a large source of new cases would be re- 
quired. The existing cases were very shorl, so their ability 
lo provide even accidental coverage for a machine with a 
"iti-enliy 1KB was questionable. Moreover, we needed cases 
thai targeted the unique microarchitectural features of the 
PA S000. 

We developed a lesl case template expander to improve 
our productiv ity in generating the large number of cases 
required. An engineer could write a lest template specifying 
a fundamental interaction, and the tool would expand this 
into a family of test cases. Some of the features of this tool 
included: 

The ability to sweep a parameter value. This was often used 
to vary the distance between two interacting instructions. 
The ability to fill in an unspecified parameter with a random 
value. 

An if construct, so that a choice between two alternatives 
could be conditional on parameters already chosen. 
Instruction groups, so that an instruction could be specified 
thai had certain characteristics without specifying the exact 
instruction. 

We also used the pseudorandom code generator and lesl 
coverage measurement techniques discussed below in the 
RTL simulation env ironment. To improve our coverage of 



multiprocessor functionality, we configured our hus emula- 
tors to generate random (but Interacting) bus traffic. 

Structural Verification 

A block can be described by a single large RTL procedure 
or by a schematic that show s the interconnection of several 
smaller blocks, each of which is described by RTL. At the 
beginning of the project, RTL tends lo be Written al a high 
level because it can simulate faster and is easier to write, 
debug, and maintain w hen the design is evolving rapidly. 
Block designers, however, have a need lo create schematics 
fin- their blocks, so there is a risk that these will diverge 
from the RTL reference. 

We considered three strategies lo verify thai the two repre- 
sentations of the block were equivalent The first, formal 
verification, was not pursued because the required tools 
were not yet available from external vendors. The second 
strategy was lo rely on Ihe switch-level verification effort. 
This was unattractive because defects would be found too 
late in the design cycle, and the planned number of vectors 
lo be nm might h01 have provided enough coverage. The 
strategy selected was lo retire Ihe higher-level RTL descrip- 
tion and replace it in the RTL model with the lower-level 
representation. The more timely and I borough verification 
that this provided compensated for some disadvantages, 
including slower simulation and more difficulty in making 
changes. We also used this strategy selectively, relying on 
sw itch-level simulation to cover regular blocks such as data 
paths w ith little risk. 

Divide and Conquer 

In any large design effort, one faces a choice of whether to 
verify components individually, together, or both. Verifying 
a component separately has several potential advantages. 
Simulation lime is greatly reduced. Input buses can be tli- 
rectiy controlled, so effort need not be expended manipulat- 
ing the larger model to provide interesting Stimulus. Finally, 
dependencies between subprojecls are eliminated. 

For separate verification to succeed, Ihe interfaces to other 
components must be very well-specified and clearly docu- 
mented. Investments must he made in a lest jig to provide 
stimulus to the component and in checking software to 
verify its outputs. In addition, some portion of the verifica- 
tion must be repealed with all components integrated lo 
guard against errors in the specifications or different inter- 
pretations of them. 

The PA 8000 s bus interface block was particularly well- 
Suited lo separate verification. The block had clean external 
interfaces but contained a lot of complexity, including the 
hardw are lo manage multiple pending memory accesses. 
A soli ware checking tool was written to monitor Ihe blocks 
interfaces and v erify its Operation. Checking thai a request 
on one bus ultimately results in a transaction on Ihe other 
bus is a simple example of numerous checks performed by 
ibis tool. A very low defect rate demonstrated the success 
of Ihe divide-and-conquer strategy for this block. 

Most of our remaining v erification effort was focused on Ihe 
complete i'A 800(1. As a final check, a system-level RTL model 
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was built that included several processors, the memory con- 
troller. I hi* Vi > adaptor arid other components. Although 
throughput was very l<> w '. basic interactions between the 
components were verified using tliis model. 

Accelerated Simulation 

Tlie speed of the RTL simulator was adequate to provide 
quick feedback on changes and for basic regression testing, 
lint we lacked confidence that on a design as complex :is the 
I'A SIKH) it would be sufficient to deliver an adequate quality 
level We saw a strong need for a simulation capability that 
was several orders of magnitude faster so thai we could 
inn enough lest cases to ferret out more subtle defects. We 
considered two technologies to provide this: cycle-based 
simulation and in-circuit emulation. 

Cycle-based simulation provides a much raster software 
simulation of the design. With an event -driven simulator 
such as our RTL simulator, a signal transition causes all 
blocks thai the signal drives to be reexecuted, and any tran- 
sitions on the outputs of these blocks are similarly propa- 
gated until all signals are stable. The overhead to process 
every signal transition, or event, is fairly high. Cycle-based 
simulators greatly improve performance by eliminating this 
overhead. The design is compiled into a long sequence Of 
Boolean operations on signal values (AND. OR. BtC ). and 
execution of this sequence simulates the operation of ihe 
logic In a clock cycle. The name cycle-based simulator 
conies from the fad thai the signal stale is only computed at 
the ends of clock cycles, with no attempt to simulate inter- 
mediate tinting information. ' ton Investigation revealed thai 
speedups of 500 times were possible, so a simulation farm 
of KM) machines could have a throughput Oft the order of 
25,000 II/.. The biggeSI drawback of Ibis strategy Was thai 
cycle-based simulators were not yet available from external 

vendors. 

With in-circiiii emulation, the gates in a Boolean representa- 
tion of the design are mapped onto a lecoiifigurahle array 
of field-programmable gale arrays ( KI'CAs). The design is 
essentially buili using FPGAs, and the emulated processor 
is connected to the processor socket In ;m actual system. 
The clock rate of the emulation system is oil the order of 

300,000 Hz, so very high I esl throughput is possible. It is 

even possible to bool the operating system. I Unfortunately, 

there were many issues involved in using iu-circuil emula- 
tion successfully: 

( 'usloni printed Circuit boards WOUld have lo be designed 
for Ihe caches, large register files, and any other regular 
Structures that consume tOO much emulation capacity. 
( hanges ill (ho design would be difficult lo accommodate. 
A system was needed lo exercise Ihe emulated processor, 

including a memory controller and l/< I devices. Firmware 

and hardw are tinkering would hav e been needed lo make 
ibis system functional al the slow clock rates required by 
Hie emulation system. 

Productivity was reduced by long compile limes ami limited 
obsen ability Of Internal signals. ( >nly one engineer al a lime 

could use the system Kir debugging. 
The atmtegj was difficult to extend to multiprocessor 
testing It was prohibitively expensive lo emulate multiple 
processors. We planned lo use a software emulator to 



Create third-party bus traffic and verify the processors' 
responses, but there was a risk that the softwares perfor- 
mance would throttle Ihe emulation system's dock rate 
• The emulation system was a very large capital investment 

We were quite wan,' of in-cireuit emulation since its use on 
a previous protect had failed to make a significant contribu- 
tion to functional verification We were also willing to give 
up ihe performance advantage of im ircuit emulation to 
avoid tackling the c;ise-or-iise issues. The decision to use 
cycle-based simulation would have been simple except thai 
ii meant that we would have to develop the simulator our- 
selves. R&D organizations in IIP are Challenged 10 focus on 
areas of core competency and look to external v endors to 
fulfill needs such as design tools that are common in the 
industry. We did selecl cycle-based simulation because we 
were confident lhai its lower risk and higher productivity 
would translate into a Competitive advantage. 

We w ere careful lo reuse components w herever possible 

and to Until the scope of the project io providing the tool 

functionality required lo verify the PA KIHIII. We did not 
attempt lo create a simulation product useful lo Other 
groups within IIP. This turned out lo be a good decision 
because comparable tools have recently Started lo become 
available from external vendors. 

Cycle-Based Simulation Compiler 

The cycle-base 1 1 simulation compiler operates only on simple 
gale-level primitives such as logic gates and latches, so 
higher-level UTI. must first be synthesized into a gale-level 
equivalent. We had to develop our ow n translator for this 
because the RTL language used by our RTL simulator 
was defined before Ihe industry standardization of such 
languages. Another simplification is thai signal values are 
limited lo t) and 1. With no attempt to model an unknown (X) 
slate. 

Pig. 2 shows a simple example circuit, a two-bit counter, 
that we wlH use lo illustrate the compilation process. The 
user must describe to the compiler Information about the 
oireiiifs clocks. The dock cycle is broken down into two or 
more phases, With Ihe stale Of the clocks fixed during each 
phase. This circuit has a clock cycle of two phases, and the 
clock (CLK) is low during the first phase and high during the 
second phase. 

The c piler uses tins information to determine which 

gale need lo be evaluated during each phase. This is done 
in two Steps. First, for each phase, the compiler propagates 
the dock values into the circuit. This uses simple rules Of 
Boolean logic, such as the fad thai Ihe output of an AND gale 
w ith a zero input must be zero. The goal Ls to Identify latches 
With a Zero control, which are therefore provahly opaque 
during that phase. Next, again for each phase, ihe compiler 

finds all gaies thai can be reached from a dock or otitei 

input through a path that does not contain an opaque latch. 

Nexi. a sequence of Boolean operations Is emitted corre- 
sponding to the gales in each phase. Because we used PA- 
RISG machines for simulation, ihe sequences were actually 
output in PA-RISC assembly language. The sequences 

totaled more than two million inslrudions for the PA 8000 
design. The gales are ordered Ifl sequence so thai a gale is 
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Fig. 2. Cycle-based simulation compilation example. 

not emitted until its inputs have been computed. Cycles, or 
loops, in the circuit are handled by looping through the gates 
in the cycle until all circuit nodes are stable. 

Numerous optimizations are done on the output assembly 
language sequences: 

Clock signals have known values during each phase, which 
can be propagated into the circuit. These constant values 
can simplify or eliminate some of the Boolean operations. 
The 32 PA-RISC registers are used to minimize loads and 
stores to memory. Boolean operation scheduling and victim 
register selection are employed to minimize the number of 
loads and stores. 

The compiler can determine which circuit nodes carry infor- 
mation from one phase to the next. The remaining nodes are 
temporaries whose values need not be flushed to memory 
after their final use within a phase. 
To eliminate NOT operations corresponding to inverting 
gates, the compiler can represent nodes in inverted form 
and perform Demorgan transformations of Boolean 
operations (e.g.. NOT-AND is equivalent to OR-NOT). 
Aliasing of circuit nodes is done to eliminate code for 
simple buffers and inverters. 

Any one of the Boolean operations in the output assembly 
language sequence operates on all 32 bits of the PA-RISC 
data path, as shown in Fig. 3. We make use of this parallelism 
to run 32 independent test cases in parallel. This is possible 
because the simulator always executes exactly the same 
sequence of assembly language instructions regardless of 
the test case (assuming the circuit being simulated is the 
same). This does not reduce the time to solution for a given 
test case, but it does increase the effective throughput of the 



simulator by 32 times. This was still very useful because our 
verification test suites are divided into a vast number of fairly 
short test cases. 

The compiler allows the user to write C++ behavioral de- 
scriptions of blocks such as memories and register files that 
are not efficient to represent using gate-level primitives. The 
compiler automatically schedules the calls to this C++ code, 
and an API (application programming interface) gives the 
code access to the block's ports. 

Pseudorandom Testing 

We had learned from previous projects that the type of 
defects likely to escape the RTI. simulation effort would 
involve subtle interactions among pending instructions and 
external bus events. With up lo 56 instructions pending in- 
side the processor and a highly concurrent, system bus with 
multiprocessing support, it is not possible to count — much 
less fully test — all of the interactions that might occur. We 
believed that the value of handwritten test cases and test 
cases randomly expanded from templates was reaching 
diminishing returns, even with the low simulation through- 
put achievable with the RTL simulator. 

We had also learned that pseudorandom code generators 
were a very effective means of finding these kinds of de- 
fects. Such a program generates a pseudorandom sequence 
of instructions that use pseudorandom memory addresses 
and pseudorandom data patterns. However, it is important 
that the program make pseudorandom selections in a man- 
ner that considers the microarchitecture of the processor 
and the kinds of interaction defects that are likely to occur. 

Selecting memory addresses is a good example. Memory 
addresses are 64 bits wide. If they were selected truly ran- 
domly, reusing the same address within a test case would be 
an impossibly rare event. This would fail to stress important 
aspects of the machine, such as the logic that detects that a 
load is dependent on a preceding store with the same ad- 
dress. There are hundreds of selections that a generator 
makes in which the microarchitecture must be carefully 
considered. 

We chose to target the cycle-based simulation environment 
for a new pseudorandom code generator. Our pseudorandom 
code generator was carefully tuned for the nucroarchitecture 
of the PA 8000 and included support for the new PA-RISC 2.0 
instruction set. Hundreds of event probabilities could be 
specified by a control file to provide engineering control 
over the types of cases being generated. 
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We aiSO Chose not to port the rich set of Checking software 
from the RTL simulation environment to the cycle-based 
simulation environment because of the effort involved and 
risk that performance would be reduced. Generators such as 
our pseudorandom code generator predict the final register 
anr I memory state of the processor, ami defects will generally 
manifest themselves as mismatches between the simulated 
and predicted final slate. It is possible that an error in stale 
will be overwritten before the end of a test case, but a de- 
fect won't be missed unless this happens in every lest case 
that hits it. which is extremely unlikely statistically. Our ex- 
perience witli hardware prototype testing, in which internal 
signals are unavailable and all checking musl be done 
through final stale, also made us confident in this strategy. 

Cycle-Based Simulation Environment 

Fig. 4 shows the cycle-based simulation environment, which 
will be described by following the life cycle of atypical teal 
case. The job controller controls the :52 independent simula- 
tions lhat are running in the data path positions, or slots, of 
the cycle-based simulation model. Il starts and ends test 
Cases in the 32 slots independently. It is controlled by a 
UNIX " shell, which is driven either by a script or Inter- 
actively for debug activities. 

When a slot becomes available, the controller commands 
the pseudorandom code generator to generate a new test 
case, occasionally first reading a new control file. The test 
case is specified by the initial state of memory and the pro- 
cessor's registers, and the pseudorandom code general or 
specifies the initial stale of the caches as well to prevent 
an initial Hurry of misses. 

The pseudorandom code generator downloads the initial 
stale of the simulation into various components of the simu- 
lated model, These include the gate-level model, behavioral 
models representing caches, register files, and Other regular 
structures, and emulators representing bus devices such as 
the memory system. I/O adapter, and third-parly proc essors. 
The model is then stepped for numerous clock cycles until a 
breakpoint trigger fires to indicate the end of the test case. 
The pseudorandom code generator is then commanded to 
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extract the relevant final state from the sinndated model and 
compare it with the final state that it predicted to determine 
whether the test case passed. 

We used a simulation farm of up to 100 desktop work- 
stations and servers for cycle-based simulation. Jobs were 
dispatched to these machines under the control of HP Task 
Broker. 1 Each job ran several thousand test cases that were 
generated using a specific pseudorandom code generator 
control file. 

Multiprocessor Testing 

Multiprocessor testing was a key focus area. We wrote emu- 
lators for additional processors and the I/O adapter, which 
share the memory bus. Il was only necessary to emulate the 
functionality required lo initiate and respond to bus trans- 
actions, but the emulators were accurate enough that defects 
in the processor related to cache coherency would manifest 
themselves as mismatches in the final memory state. 

We established linkages wilh our pseudorandom code gener- 
ator so that the emulators would be more effective. When a 
test case started, the pseudorandom code generator down- 
loaded control file information so that parameters such as 
transaction density and reply limes could he easily varied. 
The pseudorandom code generator also downloaded the 
memory addresses used by the test case so that die emula- 
tors conhl initiate transactions thai were likely to cause 
interactions. 

Coverage Improvement 

Improving the test coverage of our pseudorandom code gen- 
erator was an ongoing activity. The pseudorandom code 
generator has hundreds of adjustable values, in lnmhs, in its 
control file, which can be varied lo focus (he generated test 
Cases. We found that the defect rate Quickly fell off when all 
knobs were left at their default sellings. 

We used Iwo tactics to create more effective control files. 
First, we handcrafted files lo stress particular functional 
areas. Second, we generated files using pseudorandom lech- 
nii|iies from templates, each template specifying a particular 
random distribution for each knob. We found with both 
strategies that il was Important lo monitor the quality of the 

files generated. 

We did this in Iwo ways. First, our pseudorandom code gen- 
erator itself reported statistics on the lest cases generaled 
wilh a given control file. A good example is the frequency Of 
traps. Traps cause a large-scale reset inside the processor, 
including flushing the instruction queues, so having too 
many traps in a case effectively shortens it and reduces its 
value. We made use of instrumentation like this lo steer the 
generation of control files. 

Feedback is often needed based on events occurring within 
the processor, which our pseudorandom code generator can- 
not generally predict. For example, an engineer might need tO 
know how often the maximum number of cache misses are 
pending lo be confident lhal a certain area of logic has been 
well-tested. Test case coverage' analysis was accomplished 
by an add-on tool in the simulation environment. This tool 
included a basic language lhal allowed engineers i<> describe 
events of interest using Boolean equal ions and liming delays. 
The lisl of events could include those lhal were expected lo 
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occur regularly or even those thai a designer never expected 
lu occur. Both ends of this spectrum could provide useful 
information. 

Once the events were defined, the add-on tool provided 
monitoring capabilities during die simulation As test cases 
were run, the tool would generate output every time it de- 
lected a defined event. This output was then post processed 
and assembled into an event database. The event database 
could contain results of thousands of lest case runs. Event 
activity reports were then generated from this event data- 
base. These reports included statistics such as frequency of 
events, duration of events, the average, maximum, ;ind mini- 
mum distance between two occurrences of a event, and so 
on. 

The event activity reports were then analyzed by engineers 
to identify weak spots in coverage and provide feedback to 
the generation of control files. This methodology provided 
one other benefit as well. For many functional defects, espe- 
cially ones that were hard to hit. the conditions required 
to manifest the defect were coded and defined as an event 
Then this add-on tool was used with a model that contained 
a Fix for the defect to prove thai the conditions required for 
the deled were being generated. 

Switch-Level Simulation 

In typical ASIC design methodologies, an RTL description is 
the source code for the design, and tools are used to synthe- 
size transistor-level schematics and IC layout mechanically 
from the RTL. Verifying the equivalence of the synthesized 
design and the RTL is largely a formality to guard against 
occasional tool defects. In the full-custom methodology used 
on the PA 8000, however, designers handcraft transistor- 
level schematics to optimize clock rale, die area, and power 
dissipation. Therefore, we needed a methodology lo prove 
equivalence of the handcrafted schematics and the RTL, 

At the time the project was undertaken, formal verification 
tools to prove this equivalence were not available. Insiead, 
we turned to an internally developed switch-level simulator. 
Although much slower than the RTL simulator, the switch- 
level simulator included essential features such its the ability 
to model bidirectional transistors, variable drive strength, 
and variable charge ratios. Thanks lo this careful effort in 
switch-level verification on Ihe PA 8000, not a single defect 
was found on silicon that was related to a difference between 
the transistor-level schematics and the RTL. 

Verification was performed by proving thai a block behaved 
the same when running a lest case in the RTL simulator and 
in the Switch-level simulator. First, a full-chip RTL simulation 
of a test case was done with the ports of a block monitored. 
These vectors were then turned into stimulus and assertions 
for a switch-level simulation of the block. Initializing the 
state of the block identically in the two environments was a 
challenge, especially since the hierarchies and signal names 
of the RTL and schematic representations can differ. 

Initially, this strategy was used lo turn on the switch-lev el 
simulator models of individual blocks on the chip. This 
helped to distribute the debug effort and quickly bring 



all blocks up tO a reasonable quality level. Afterward. Ihe 
focus Shifted lo lull chip Switch-level simulator verification. 
In addition lo collecting veclors at the pons of the chip, 
thousands of internal signals were monitored in the RTL 
simulation anil transformed into assertions for the swilch- 
level Simulation. These w ere valuable for debugging and 
raising our confidence that there were no subtle behavioral 
differences between the two models. 

The RTL simulation effort was a plentiful source of test 
cases, but they were targeted at functional defects rather 
than implementation errors, and the slower speed of the 
switch-level simulator allowed only a portion of them lo be 
run. To improve coverage, the process shown in Fig. 5 was 
used at Ihe block level. The RTL description for Ihe block 
was converted into an equivalent gale-level model using 
tools developed for cycle-based simulation. Automated test 
generation tools, normally used later in the project for man- 
ufacturing, were then used lo create lesl veclors for Ihe 
gale-level model. If Ihe switch-level simulation using these 
vectors failed, then Ihe two representations were known 
lo differ. While Ihe aulomaled lest generation tools do not 
generate perfect lesl veclors, this process slill proved lo be 
a valuable source of additional coverage. 

The switch-level simulator also supports several different 
kinds of quality checks, These include dynamic decay 
checking to detect undriveu nodes, drive light checking to 
deled when multiple gates are driving the same node, and a 
race checking methodology. This was implemented by alter- 
ing how ihe clock generator circuits were modeled to create 
overlap bei ween the different clocks on Ihe chip. Failures 
that arose from ov erlapped clocks pointed lo paths requiring 
detailed SPICK simulalions lo prove thai Ihe race could not 
occur in Ihe real circuits. Reset simulations were done from 
random initial states to ensure that Ihe chip would power up 
properly. Finally, a switch-level simulator model was built 
from artwork nellists to prove thai there were no mismatches 
between the artwork mid the schematics that were missed by 
other tin ils. 
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Fig. 5. Process used at the block level n> Improve test coverage. 
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Postsilicon Verification 

Presilicon verification lechniimes arc adequate I" find nearly 
all of the defects in a design and bring the level of quality up 
to the point where the 1'irsi prototypes are useful to software 
partners. However, defects can be found in itostsilicon veri- 
fii ation that eluded presilicon verification for many reasons 

First, test code can be run at hardware speeds, several orders 
Of magnitude faster llian even I he faslesi simulators. Errors 
in the simulation models or limitations in the simulators 
themselves can cause the behavior of the silicon to differ 
from that predicted by the simulators. Finally, most simula- 
tion is targeted at system components, such as the PA 8000 
itself, rather thai) the entire system Errors in (lie specifica- 
tions for interfaces between Components, 01 different inter- 
pretations of these specifications, can become apparent when 
all components are integrated and the system exercised. 

Overlapped Test Coverage 

Previous projects had established the value of running lest 
code from as many sources as possible. Each test effort bad 
its own focUS and unique value, bul each also had its own 
blind spots. This is even true for pseudorandom code gener- 
ators. In the design of these very complex programs, many 
decisions axe made that affect the character and style of 

the generated code. There can be code defects as well that 
cause coverage holes. The huge overlap in coverage be- 
tween efforts proved lo be an Invaluable safety net against 
the limitations and blind spots of indiv idual tools. 

The I'A S000 Verification team focused Its effort On pseudo- 
random code testing. Experience showed that ibis would 
be the primary source of subtle defects and would allow as 
to find most defects before our soil ware partners. We ran 
several loois including our pseudorandom code generator 
and generators used in the development Of the PA 720(1 and 
PA 7300LG processors and the HP 9000 Model 725 work- 
slalion. Several tools were capable of generating true multi- 
processing lest cases ilial included data sharing between 

random sequences running on different processors. Data 

sharing with DMA processes was implemented as well 

We developed a common test environmenl for mosl oftbese 

random code generators. The IIP I X 1 operating system is 
nol a suitable environment because its protection checks 
do not permit many of the processor resources lo be easily 
manipulated. Ourtesl environment allowed random testing 
of privileged operations and also included many features to 
improve repeatability and facilitate debugging. For example, 
il performed careful initialization before each lesl case so 
thai, aided by logic analyzer traces, we could move a failing 
test case lo the UTI. simulator for easy debugging (in hard- 
ware, there is no access lo internal signals). We established 
a plug-and-play API so that the investmenl in I he env iron 
meat could be leveraged across several generators. 

In parallel With OUT pseud o ra nd o i n testing, our software 
partners pursued their own testing efforts. W hile primarily 

targeted at their own software, this provided stress for ihc 

processor as well. The lest efforts included the lll'-l X and 
MPE/XL operating system kernels. I/O and network drivers, 

commands, libraries, ami compilers. Performance testing 

also provided coverage of benchmarks ami key applications. 



Finally, although it did not find any defects in the PA SINK). 
HP's Early Access Program made available prepTJOdUCOOP 
imits to customers ami external application developers. 

Ongoing Improvement 

When defects w ere found, we used the process shown in 
Fig. <i to learn as much as possible about why the defect was 
missed previously and how coverage could be improved to 
find additional related defects After the root cause of the 
defect was determined, actions were taken in the areas of 
design, presilicon verification, and postsilicon venfication. 

The designers would identify workarounds for the defect 
and communicate these to our software partners, at the 
same time seeking their input to evaluate the urgency for 
a lape release lo fix the defect. The design fix was also 
determined, and inspections were done lo validate the fix 
and brainstorm for similar defects. 

In the area of presilicon verification, reasons why the defect 
was missed would be assessed. This usually turned oul lo be 
a lesl case coverage problem or a Wind spol in Hie checking 
software Models would then be built with the design fix and 
Other corrections. Tesl coverage would be enhanced in the 
area of the defect, and simulations were done to prove the 
fix and search for any related defects. Cycle-based simula- 
tion plaved a big role here by finding several introduced 
defects and incomplete fixes. 

The postsilicon verification activities were similar, ( "overage 
for ihe tool thai found Ihe defect would be enhanced, either 

by focusing ii with control files or by tool improvements. 
Spurred by a healthy riv alry, engineers w ho ow ned other 
tools would frequently Improve them to show thai they 
could hit the defect as well. All of ibis contributed to finding 
related defects. 
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Performance Verification 



Results 



At the lime the I'A 8800 was introduced in products, it was 
llip world's fastest available microprocessor. Careful micro- 
archil eel tiral optimization and verification of the design 
against performance specifications were factors in achieving 
ibis leadership performance. 

In a microarchitectural design as complex as the I'A Sill It), 
seemingly obscure definition decisions and deviations of the 
design from the specification can cause a significant loss of 
performance when system-level effects and a variety of 
workloads are considered. A good example is a design 
defect that was found and corrected in the PA 8000. When 
a cache miss occurred under certain circumstances, a dirty 
cache line being evicted from the cache would be written out 
on the system bus before the read request for the missing 
line was issued. Since the addresses of the two lines have 
similar low-order bits, both mapped to the same bank of 
main memory. The memory controller would begin process- 
ing the write as soon as it was visible on the bus, busying 
the memory bank and delaying the processing of the more 
critical read. 

A detailed microarchitectural performance simulator was 
written early in the project to help guard against such issues. 
It was used to project performance and generate a statistical 
profile for a variety of benchmarks and applications. Work- 
loads with surprising results or anomalous statistics were 
targeted for more detailed analysis, and through this process 
opportunities were identified to improve the microarchitec- 
ture. Particularly valuable feedback came from the compiler 
development team, who used the simulator to evaluate the 
performance of compiler prototypes. The concurrent devel- 
opment of tuned compilers with close cooperation between 
hardware and Software CeatBS was a key contributor to the 
PA 8000's performance leadership. 

The microarchitectural performance simulator was written at 
a Somewhat abstract level, so it could not prov ide feedback 
on whether the detailed design met the performance specifi- 
cal ions. Comparing the performance of the RTI, simulator 
against the microarchitectural simulator was the obvious 
way to address this, but the HTL simulator was far too slow. 
As a compromise, we performed this comparison on key 
performance kernels that were tractable enough for the RTI. 
simulator. We also developed a path by which a workload 
could be run up to a critical point using the microarchitec- 
tural simulator, at which point the stale of the memory, 
caches, and processor registers could be transferred into 
the RTL simulator for detailed simulation. 

Performance verification continued in the poslsilicon phase 
of the project. The PA 8000 incorporated several performance 
counters that could be configured to count numerous events. 
These were used to help identify workloads or segments of 
workloads needing closer analysis. The I'A SOOO's external 
pins and debug port provided sufficient information to 
determine when instructions were fetched, issued for execu- 
tion, and retired. Isolation of specific performance issues 
was aided by a software tool called the depiper which pre- 
sented a visual picture of instruction execution. Through 
these efforts, several performance-related hardware defects 
were identified and corrected before production. 



Achieving a defect-free design at first tape release is not a 
realistic expectation for a design as complex as the I'A 8000. 
Nevertheless, we were extremely satisfied with the quality 
we achieved at first tape release. The first prototypes were 
capable of booting the operating system and running virtual!) 
any application correctly. In fact, only one defect was ever 
hit by an application, although a few defects were encoun- 
tered in stress testing of system software. 

Fig. 7 shows the sources of defects found and col lected after 
first tape release. Surprisingly, about a third of the total de- 
fects were found by continued use of the ptesilicon verifica- 
tion tools (mostly the cycle-based simulation environment) 
for a few months following tape release. This indicates that 
despite the outstanding performance of cycle-based simula- 
tion, the project would have benefited from even more 
throughput, o i perhaps use of the tool earlier. A third of the 
defects were also found by one of the pseudorandom code 
generators miming on hardware prototypes. Inspections 
were a significant source of defects. The remaining defects 
were splil between turn-on work, performance analysis 
work, and partner software testing. Sitae very few defects 
were discovered by partners, we could generally communi- 
cate workarounds ahead of time and lake other steps to 
minimize die impact 

Fig. 8 shows the impact of the defects found after first tape 
release on our software partners. A large majority were never 
seen outside the environment in which they were found and 
had no significant impact. About half of these involved func- 
tional areas, such as debugging hardware, thai are not even 
visible to applications or system software. Most of the re- 
maining defects had only a moderate impact Examples are 
defects that were found by a partner at I he expense of I heir 
testing resources, defects thai required a workaround in sys- 
tem software; mid defects thai required certain peri'ormance- 
related features in the processor lo be disabled. ( Inly a hand- 
ful of defects were severe enough to temporarily block or 
significantly disrupt a partner's dev elopment and testing 
efforts. All but one of these were early multiprocessing de- 
fects that slightly delayed bringing up Ihe multiprocessing 
operating system. 
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Fig. 8. Partner Impai I 

C ycle-Based Simulation Results 

The cycle-based simulation effort made an essential con- 
tribution to the verification of the PA 800ft Fig. 9 shows the 
sources Of defects lliai eluded our RTL siinulalioii effort 
(which incorporated existing best practices). If we had not 
made the investment in cycle-based simulation. Ihe number 
or delects thai would have had lo be round by postsilicon 
techniques would have been three limes higher. It was much 
less expensive to fix the defects caught by cycle-based simu- 
lation as the design progressed than it would have been to 
fix them in later revisions. 

Also, because cycle-based simulation tended to find the 
most severe defects early, no masking defects were present, 
and the number of serious blocking defects that we had lo 
manage after the first tape release Was reduced by three lo 
six limes. If our software partners had been exposed lo this 
level of severe defects, it is probable that the products lime 
lo market would have been impacted. 

Finally, cycle-based simulation provided a high-confidence 
regression lesl before each lape release. Several incom- 
plete bug fixes and new defects thai had been introduced in 
the design were found in time to be corrected before a lape 
release. 

Conclusions 

Continuous innovation in functional verification tools and 
processes is required to keep pace willi Ihe increasing micro- 
architectural complexity of today's (Tl 's. This [taper has 
described Ihe methodologies used lo verify Ihe PA 8000. 



Fig. 9. I ".-ferls escaping traditional prcsllicon verification. 

These met our most important goal of improving the quality 
of the PA 8000 to Ihe high level demanded by our customers. 
By finding defects early, they also helped us conserve our 
engineering resources and i|uickly deliver the industry- 
leading performance of the PA 8000. 
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Electrical Verification of the 
HP PA 8000 Processor 



Electrical verification applies techniques from both functional verification 
and reliability and environmental testing to improve the quality of the 
CPU. Electrical verification checks that the CPU functions correctly under 
stressful environmental conditions, well outside the normal operating 
environment. 

by John W. Bockhaus, Rohil Bhatia. ('. Michael Ramsey. Joseph R. Butler, 
ami David J. Ljimg 



Karly in ;i product's design life cycle, considerable attention 
is paid in its functional corrccincss. This functional verifica- 
tion is carried out on the earliest prototypes and, especially 
in the case of complex devices such as largo VLSI circuits, 
even earlier through simulation. 

As a product approaches customer shipments, testing it 
against HP's stringent reliability arid environmental specifi- 
cations is a critical task. 

In between these CWO lest methodologies, there is a third 

method thai Is becoming increasing^ important This is 
postsilicon electrical verification. Electrical verification 
applies techniques from both functional verification and 
reliability and environmental testing to improve I he Quality 
Of 8 device or product. 

The philosophy of electrical verification is different from the 
Other two methods. While it is possible, although not neces- 
sarily common, to complete functional verification and reli- 
ability testing without finding reasons to change a design, 
electrical verification's goal is to find a design's weaknesses 
and fix them. Even a very good design should nunc out of 
electrical verification with a higher level of quality. 

Like functional verification, electrical verification seeks to 
exercise as many of a design's logic states, signal paths, and 
slate transitions as possible. However, entering a stall', driv- 
ing a signal, or triggering a transition once is not enough for 
electrical verification. Combining reliability testing w ith the 
coverage of functional verification, electrical verification 
repeals the Functional tests under stressful conditions 
beyond w hat a product may ever see in a real application. 

Electrical verification goes beyond the limits of reliability and 
environmental testing, w hich is typically done only at die 
system level. Reliability tests are usually done independently 
of each other. For example, line voltage variations are not 
applied concurrently with ambient temperature tests. Reli- 
ability tests stop at predefined limits. In contrast, electrical 
verification varies many lest parameters at the same time. 
The ranges of those parameters are continually increased 
until failures are found. 



Electrical verification is not simply a random scattering Of 
tests executed in the hopeful search for some kind of statis- 
tical confidence. Instead, the goal is complete coverage 

of the product's operating space and beyond. This serves 
two purposes. First, the design is verified overall of the com- 
binations of actual Conditions it may encounter. Secondly, 
failure mechanisms or critical features that iua> lie outside 
normal operating limits can be found, identified, anil possi- 
bly fixed. These items, such as critical liming paths, charge 
sharing conditions, or marginal driver strengths, can move 
inside normal limits as a product ages, manufacturing 
conditions change, or other unanticipated situations arise. 
Removing them early in the design life cycle ensures a reli- 
able product with a longer life for the customer ami avoids 
a costly iii-production change for IIP Furthermore, fixing 
these electrical failures can increase the yield of the device 

at a given frequency and can enable higher-frequency and 
higher-performance upgrades. 

An additional purpose of electrical verification is to help de- 
termine production test limits and guardbands. By including 
IC process parameters in the verification effort, test limits 
can be extrapolated to predict proper function through 
normal operating conditions. 

Fig. I is a flowchart of the electrical verification process. 
Shmoo Plots 

The hunt lor electrical bugs begins w ith the sliiiiim plot. The 
shmoo was a character in the Lil' Abner cartoon strip that 
hail a changing, bloblike shape. The shmoo plot show s 
w bether the device under test passed or failed as a function 
of various combinations of electrical parameters applied to 
it. The name has become part of the engineer's vernacular 
because the region where a particular device passes or fails, 
plotted against the parameters applied, may have some of 
the rounded curves and shifting nature of the mythical 
shmoo. t Mien, the shape of the plot conveys information 
about the failures. Many common shapes have been given 
names (see page .'5o). 
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Fig. l. 'I'll-' i h i inrai verification process. 

The slit i inn's naitii' Iuls also become a verb. To shmoo is in 

run a icsi repeatedly while varying one or more parameters. 

These parameters are (lie axes ol ihe shmoo |>loi ami include 

hi >ih the ob\ ions an<i iiif nonobvious 

I ihvions shmoo ploi parameters include power supply voltage 
and temperature In complex syslems sueh as lln>se using 
Ihe PA 8000, many different supply voltages exisl. Supply 
voltages anil temperature are clear choices for shmoo 

plotting because they so directly affect the operation of 
electronic devices, 

The It ' manufacturing process is a key shmoo plot parameter 
for projects like the I'A SI it III. In keeping with I he goal of 
methodically covering the shmoo space, testing a large, 
random sample of parts isn't enough. Instead, prototype 
parts are manufactured With one or more process melrics 

intentionally modified. Typical IC shmoo plot parameters 

are Iransislor gate length and leakage currents. 

i lock frequencies are also good shmoo plot parameters. 

Increasing Ihe frei|ueney ol' clocks is a good way to find 
slov\ signal paths. However, pushing frequencies higher only 
tells some of ihe device's story. I seful information can also 
he found by seeing how slowly it can go and by testing many 
frequencies in between ihe maximum and minimum. Logic 
races and transmission line reflections are.jusl two potential 

problems thai may lurk in the low frequencies, w hich ihe 

engineei often assumes are "easy." 



A slunoo plot parameter that may l>e less obvious is the soft- 
ware executed on the device under test. Good shmoo plot 

code seeks to exercise as much of the device as |K>ssible. 
Executing a t>ower-up self-test or booting an operating sys- 
tem may seem complicated, hut they are not necessarily 
good shntQO tests. The PA 8000 shmoo process used a large 
number Of tests that ranged from specially designed to 
randomly generated. 

Test Cases 

Tlie success of any verification effort depends to a large 
extent on Ihe nature and type of test cases that are run on 
the CPU. The testing code needs to be good enough so that 
when the systems reach customers the (PI ' is bug-free. To 
ensure this, the tests must provide adequate coverage of Ihe 
design features incorporated into the ("PI I, Furthermore, 
because of the complexity of today's processors, it is impos- 
sible to imagine all of the interactions thai must occur to 

cause a particular event in the processor. This complexity 

necessitates the use of random testing. In the postsilicou 
environment, random testing is aided by the fact that 
throughput is generally not a problem (compared with the 
presilicon simulations done on software models, w hich are 
millions of limes slower than running on Ihe actual hard- 
ware) and therefore a huge volume of random testing can 
be accomplished. 

The electrical verification of the HP PA 8000 CPU relied 
upon the following sources for test cases: 
1 Directed handwritten tests 

1 Focused random tests targeted at specific GPU functions 
Random code generators 

Library of worst-case tests for previous bugs 
i HP-UX 1 application code. 

In most instances, these tesl cases were checking for fail- 
ures in real lime. In general, they would set up some initial 
Conditions, run the tesl COdfii and perform some checking. 
Some cases checked for a specific outcome in memory or a 
general register, i it hers compared the full architectural state 
at the end of Ihe lest case to some expected final stale. 

For mosi of the sources mentioned above, the tesl eases 
were leveraged from the presilicon verification effort 

through Ihe use of various scripts to perform modifications 
for ihe postsilicou operating environment Several benefits 

can be realized by leveraging Ihe Work from presilicon veri- 
fication. First, leveraging all the tools results In less develop- 
ment time and hence less total work for the postsilicou veri- 
fication leant. Second, by sharing the tests and fools, we get 
a common environmenl between presilicon ami postsUicon 

verification. This allows easier modeling of poslsilicon fail- 
ures in presilicon simulation tools and makes the learning 
curve easier as well. It also provides a path to go back ami 

forth between the two environments, w hich allows directed 

lest development for poslsilicon failures. 

Since presilicon verification is targeted at functional cop- 

reclness, ihe use of tests leveraged from that effort requires 
some caul ion F.lectrical verification tests in general need to 
be much more tlala-pattcm-scnsitive ihau their functional 
counterparts. For example, a bus may need lo be driven 
from all Zeros tO all ones or alternating ones and /.cms lo 

adequately tesl for electrical failures, w hereas the data pat- 
terns in a functional test don't usually influence the logical 
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correctness of the design. Therefore, lite random rode gener- 
ators used in 1 lie electrical verification effort were modified 
to provide knobs thai allowed the tesl ease writers to control 
Ihe data patients used in these tests. 

Our library of worst-case test code for all lings known to 
date Was valuable in ensuring that no known failure mecha- 
nisms had gone uncovered as we went through different 
revisions of the chip. This library of tests was always kept 
updated and used repeatedly on all ( IT parts. 

One final source of lest cases was our HIM X stress lest 
suite, Noi only is this most similar to what the majority of 
customers are going to run. but HP-UX testing can also offer 
the most coverage. Usually this is used as the last set of test- 
ing to make sure that all of our tesl coverage so far has been 
close enough to the stress that HP-UX code puts the PA 8000 
through. The stress lest suite consists Of a number of applica- 
tions including the SPEC benchmarks and scripts that make 
sure that everything is running correctly. The drawbacks 
are thai Ihe run lime for HP-UX stress tests is a few hours, 
which is orders of magnitude longer than the other lest 
types, and thai these tests are much harder lo debug if 
failures occur. 

Automated Tools 

To save lime, we created a number of tools thai help us auto- 
mate many of Ihe verification tasks. Each of our lest systems 
was connected to a controller system (see Fig. 2). We then 
replaced the boot U< >Ms on the lest system with R( )M emu- 
lators. Our controller system controlled the voltages, fre- 
quencies, and temperature and the code thai was run on the 
test system, and it also monitored all of the I/O to and from 
the test system through the RS-232 port. 

We needed to have complete control over the code that was 
ran on the test system. Instead of using the boot ROM to 
initialize the system so thai we could run a full operating 
system, we used our own framework code t ailed Ihe 
CharROM, short for rlinrnrtrriziiliuii ROM (see Fig. 3). The 
CharRl >M installed a subset of the normal initialization code 
and then ran test code for us. All of our tests w ere compiled 
into a separate ROM called a Irsl/iOM which could be up- 
loaded w ithout changing the CharRl >M. Each tesl ROM con- 
tained the test code as well as information on how to nut 
each test, such as verbosity settings and number of iterations. 
Il also contained information about the test that was printed 
out so that our tools could save information on what tests 
we were running. With our systems set up this way, we could 



Fig. 2. Electrical verification test 
system setup 

turn on power and within a few seconds Ihe lest system 
would be printing out initialization information and then test 
codes. 

The CharRl )M had the flexibility to run tests in batch mode, 
one at a lime according to the tesl ROM, or Interactively for 
debugging. In interactive mode the CharRl >M lei us run tests 
and modify and view Ihe state of the chip. 

The big advantage oftheCharROM was thai it booted quickly 

and lei us change tests rapidly. This sav ed a great deal of 
lime during debugging when we needed to run many code 
experiments. Il also eliminated a lot of the boot code thai 
would be needed to run something like the HP-UX operating 
system. This meant thai we were less likely lo hit a failure 
while booting and if we did we could often make quick 
changes In the ( harRl >M to avoid Ihe failure. This was most 
obvious when we discovered an electrical bug in a branch 
instruction that was keeping us from booting. We were able 
lo rewrite the CharROM framework in two days without 
using the failing types of branches, something that we could 
never have done with a full-Hedged boot ROM. 

The most important feature of Ihe CharRl )M was Ihe control 
it gave us over exactly what code was being run on Ihe 
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Fig. 3. 'fhe CharRl im makes ii possible iii iiiioi quickly unci i hangs 

tests rapidly, limit starlsal the beginning Of. tin- <'liarH< >M arcri_exec 
and tin- tesl HI »M are copied to memory for speed reasons. Basic 
assembly tests are run either In R< >M ol in meRtOr) Modified phase I 
formal tests are unpacked into available memory by arch.exec. which 
runs anil cheeks them 
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Shmoo Plot Shapes 



A shmoo plot is a graph that represents how a particular test passes or 
fails when parameters like frequency, voltage, or temperature are varied 
and the test is executed repeatedly The shape of the failing region is 
meaningful and helps in determining the cause of the failure. Shmoo 
plots typically fall into familiar categones with descriptive names 
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A shmoo plot of normal circuit operation shows better high-frequency 
performance as supply voltage increases, as shown m Fig la However, 
other shapes frequently seen include the curlback (Fig. lb), ceiling 
(Fig Id floor (Fig Id), wall (Fig le). finger (Fig If), and breaking wave 
(Fig. ig). 
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system. Running tests in an environment like HP-UX leaves 
the tester at the mercy of the operating system, which can 
switch processes in and out anil limit access to privileged 
operations. 

One of the important responsibilities of the CharROM was 
lo initialize the chip state between tests. This helped control 
repeatability for a given (est (so we weren't dependent on 
a random chip state) and also helped avoid dependencies 
between tests, thai is. a previous test affecting the run of a 
future test. 

Inside the CharROM we had a stibframework called arch_exec 
that could run our modified presilicon test cases. arch_exec 
lakes apart the initial stale and sets up the chip accordingly. 
After Hie test is run, arch^exec compares the chip state to the 
expected final state information in the test, automatically 
showing us any failures. This let us deal with many tests in 
bulk. 

To run our shmoo tests we had scripts that would boot the 
systems at each point in the shmoo test domain and read 
the output to decide if the tests had passed or failed. After 
running a shmoo test or even during a shmoo test we could 
analyze the output lo ignore certain failures or focus on 
specific failures. 

At the completion of a shmoo test, the shmoo script stored 
all I he output in our shmoo database. We used the database 
to look for specific tests or specific pails so we could avoid 
duplicating work. This also turned out to be very useful 



when we needed to take another look at past bugs; we still 
had all of the shmoo information for the bug work. 

Failure Identification 

The process of elect rical verification of a CPU begins with 
the task of identifying electrical failures. An electrical failure 
can be described as the malfunction of a chip under certain 
bill not all operating environments. If Ihe failure occurs 
regardless of the operating environment, then it is termed a 
functional failure and is not covered here. Typically, electri- 
cal failures can be traced to some electrical phenomenon 
that occurs only under certain operating environments. 
Some examples include latch setup or hold time violations, 
noise issues, charge sharing, leakage issues, and cross talk. 

The task of identifying new failure mechanisms involves a 
number of steps. First, test code must be selected and run in 
a variety of operating environments. The data collected from 
this is displayed graphically in shmoo plots and anomalies 
are noted. Next, the anomalies are checked for repeatability. 
If Ihe anomaly repeats reliably, the failure signature is ana- 
lyzed in an attempt lo classify the failure or narrow it down 
to a particular area of the CPU, if possible. The sensitivities 
to different operating environment variables are also deter- 
mined to gain further understanding of the failure mode 
before it is moved into the debugging stage. Eac h of these 
sleps will be discussed in more detail. 
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Searching For an Anomaly 

Electrical verification oflhe I'A SIKIO ('IT encompasses .1 
huge lesl space thai is impossible lo cover in a reasonable 
amount Of lime. This is partly because oflhe increasing 

complexity' of devices ami partly because of the large num- 
ber Of operating environment variables. < Ipcrating variables 
include lest co<le. ambient temperature, several supply volt- 
ages, frequency and bus ratios, types of ( PI ' chips (i.e.. vari- 
ations in the fabrication process)! different speed grades of 
cache SKAMs. and many others A number of techniques 
were applied in the electrical verification oflhe I'A 8000 
CPU that, taken together, effectively covered this large lest 

s| iace 

Initially, the emphasis is placed on varying a large number of 
the Operating variables and exercising the ( IT with simple 
test code. A variety of ( IT pails from different comers of 
the fabrication process are deliberately selected and run 
under various combinations of temperature and supply volt- 
ages to look for failures, For example, known fast ftS 8000 
CPUs were run in a cold chamber at high supply v oltages to 
look for one class of failures. Similarly, a set of slow CP1 s 
were run in a hot chamber at low supply collages to look for 
another class of failures. Experience is always a good guide 
to the operating variable combinations thai are likely lo yield 
failures. 

Stress testing is anolher technique thai is applied to induce 
failures. Stress testing refers lo running the CPU with lesl 
code on the fringes oflhe operating env ironmenls. under 
conditions to which an actual system in the field may never 
be subject ed. However, a failure induced in this fashion can 
often be moved into an operating region thai we care about, 
simply by further experimental ion and analysis. 

As the process of electrical verification proceeds, the em- 
phasis shifts from running simple lest code at a variety of 
operating points to more complex code sequences at fewer 
operating points. This can be compared to exploring the test 
space from a breudtk-first search lo a depili-first search. 
The more complex code sequences arc deriv ed from run- 
ning several random code sequence generators, pseudoran- 
dom focused lests. directed tests, and HP-UX application 
code. 

I sing one or more oflhe techniques outlined above, test 
data is gathered and can be viewed in shmoo plots. These 
plots are examined and compared lo w hat has been ob- 
served in I he past for previous lesl runs on earlier silicon 
revisions, [f the shmoo plots rev eal regions of failure not 
observed before, Ihen we have a shiiinn anomaly, w hich 
needs (0 be pursued fuilher. 

Verifying Repeatability 

Once a shmoo anomaly is identified, a number of steps need 
to be taken to validate and confirm it. It is important thai the 
anomaly be reliably repeatable and be traceable to a ('PI' 
malfunction; The steps outlined below are used to satisfy 
the repeatability requirement 

I. The failing code sequence Ls rerun several limes on the 
same CPU to confirm failure. This is done to rule out the 
possibility thai an inadvertent change in the operating 
environment may have induced the failure. 



2. In a system verification environment, several other com- 
ponents must be remov ed from suspicion before the anoma- 
lous behav ior can be attributed to the ( IT. The failing CPU 
can be placed in a completely different system and the failing 
code sequence rerun under similar operating conditions lo 
repeal the failure. The failing ( 'PI ' can also be used in sev era I 

different processor boards to rule out any dependencies on 
the processor board characteristics. 

3. The next step is lo try tO locale the failure mode on differ- 
ent hut similarly fabricated CPU parts. These could be parts 

IV the same wafer or with similar process eharaclcrislics. 

If the failure mode is not Observed On any other < IT. ihen it 
is generally considered lo be a lesl escape from the wafer 
and package screens, meaning there is a defect on this chip 
that the wafer and package screens did not find. In other 
words, we have not found an inherent problem with any 
circuits on the chip. In that case, we will investigate whether 

we have a coverage hole in our wafer or package screens, 

rather than move this failure into the debugging phase. 

I. If the above three steps are satisfied. Ihen I he failure mode 

is checked lor sensitivities to different operating environ nl 

Variables. Most electrical failures are modulated by one or 
more of the operating Variables, Whether il be temperature, 

supply voltage, or delays on Ice) system clocks. This can not 

Only expand the failure region, but also provide some clues 
lo the lype of failure, which can be exiremely useful infor- 
mation for the task of debugging. 

5. Throughout the process of verifying the repeatability of 
the failure mode, it is also important to watch the failure 
Signature and check il for consistency. That is. one must 
ensure thai each rerun of the lest code is producing the 
same failure mode. 

Classifying the Failure Mode 

After a shmoo anomaly is identified and has passed the 
repeatability requirement, it is lime to classify anil list the 
characteristics of the failure mode to see if it is unique or is 
one thai has been observed before. Either classification is 
important If it is new, then il needs lo be debugged fully and 
its root cause determined. ( )n the other hand, if il is a repeat 
failure mode, then this failing code sequence needs to be 
compared with the current known worst case for that failure 
mode and understood as well. 

Failing code sequences can come from a variety of sources. 
Typically, each failing sequence will have a certain failure 
signature and much can be learned from il. Here, we will 
discuss three types of failures. 

The first type of failure comes from self-checking code. 
A typical example might be code written to exercise and 
walk known patterns through the cache SRAMs. Such code 
will check its results and the failure messages will be self- 
explanatory. 

The second type I »f failure is a final-slate emir, generally pro- 
duced by random code generators. Random code generators 
produce tests that consist of initial I PI stale, a sequence 
of assembly instruct ions, and an expected final stale. When 
8 tcsi terminates, the final state is checked against the ex- 
peeled final stale and discrepancies are noted. By analyzing 
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the final-siati- error messages and looking at the test code 
sequence, one can infei quite a bit aboUl tin- nature of the 
failure and come up witli a set OffiXpUllUentS to further 
zero in on the failure. 

The ihinl type of failure is one in the framework code. 
Framework code is rode written to allow tools such as ran- 
dom code generators to run on the hardware The frame- 
work provides the environment for initializing memory, 
caches, and the architected slate of the ("PI . Sometimes, 
especially early in the project, failures w ill occur in running 
the framework code. In general, these are hard lo debug 
since the failing code sequence (the framework) could be 
thousands or millions of instructions long. 

The c haracteristics or the failure mode are determined by 
lulling the sensitivities to different operating environment 
variables from lite repeatability experiments above 01 
through additional experiments at Ibis stage. It is important 

io do some amount of debugging and failure characteristic 

determination to rule out most known failure modes to dale. 

To summarize, the task of bug identification is complete 
when we have accomplished all of the above and have made 
a reasonable eflbri to rule out known problems. We now 
have a new bug thai is ready lo be taken through I he next 
lask. debugging. 

Debugging 

The goal of the debugging effort is to determine the root 
cause of I he failure and fix il on llie chip in a new revision. 
The main steps lo achieve this goal are gathering data about 
the failure, expanding the failure region, and hypothesizing 
the cause Of the failure. These steps are all pari of an itera- 
tive process that can lead us to our goal ol complete under- 
standing. As more data is gathered abOUl the failure, a more 
complete and accurate hypothesis can be formed and a 
more accurate worst -case vector can be determined. < >n the 
Other hand. ncv\ Information may also prove that our initial 
hypothesis was incorrect, In that case, we go back Io the 
data gathering step lo acquire mote information aboul llie 
failure. When all of our data is consistent with our hypothe- 
sis, we have the rool cause of the failure. 

Gathering Data 

( tnre we have determined from the bug identification pro 

cess that the failure is one that we have nol seen before, we 
need to gather as much data as possible about Ibis new fail- 
ure. If multiple chips fail in llie same way. we may be able lo 
correlate the failure with a specific wafer or lot or to a spe- 
cilic speed grade of the chip. For example, il is possible that 

onfj chips wiih extremely slow FKTs will fail. Checking 
which revisions of the chip fail can tell us whether the fail- 
ure is related to a receffl change ill llie chip, or whether il 
has always been I here. 

The next step is to shorten llie instruction sequence I hat will 
cause the failure. I Mien, failures occur In sequences of over 

loo instructions, but the failure itself usually requires only 

a few specific data patterns and some specific Instruction 
liming. Determining where in the insl ruction sequence llie 
failure is Occurring is one step toward isolating the failure. 

Occasionally, the failing instruction is easy to find For ex- 
ample, If only one instruction Ul the case modified the failing 



register and the inputs to that one instruction did nol change 
during the case, the failure has been isolated lo the insinic- 
tion sequence around thai instruction, ("snarly, it is more 
difficult. If the failure causes an unrecoverable trap or reads 
bad data from the cache or main memory, we need more 
information before deciding where the failure occurred. 
For example, if a loail from cache reads the wrong data, is 
it because another instruction stored bad data, or were the 
address lines Incorrect during the read, or did the CPU 
corrupt the data after it was read? 

Failures in the framework of a random code generator are 
quite difficult to debug, especially if the framework is written 
in a high-level language such as C++. If llie failure can be 
localized to a specific code sequence, which could Im- quite 
long, it can usually be ported to a siandalone case (no 
framework involved). From there, the same steps are taken 
as with any failure sequence lo shorten the lest case while 
maintaining the same failure mode, 

Monitoring external events may be useful. Logic analyzers 
attached 10 external pins, such as the system bus interface 
and the cache interface, provide a picture of what the i IM 
is doing when the failure occurs and can help narrow down 
where in the code il is failing. We may see instruction fetches 
from main memory, w hich can tell us w hat area of the code 
the GPU is executing. Since the logic analyzer stores many, 
many previous stales, we can look back through the execu- 
tion of the case lo see when bad dala starts to appear, If the 
failure relates lo an off-chip path, oscilloscopes can be used 
to verify the signal integrity of suspect paths. We have used 
I his method when debugging noise-related problems and 
failures caused by imprecise impedance matching, 

We would like lo narrow down the code sequence lo a very 
short sequence of instructions thai w ill still fail in the same 
way as the original case. Creating a very short case such as 
this is not easy, especially considering thai the PA SOOll uses 
oiil-ol'-ordcr execution. ( 'hanging the original lest case in 
any way may change Hie liming of certain events in the case 
such that il may nol fail anymore. In gathering information 
about the failure, it is important lo determine what events 

contribute directly to the failure and in w hat sequence the 
events must occur for the failure to occur. Just removing 

insl rucl ions starling at the beginning of the case may nol 
help. Suppose thai a load instruction. Which normally would 

have caused a cache miss and a request to main memory, is 

removed from the beginning of the ease The behavior of the 
case will change because the uexi memory operation that 
BCCesses that cache line will cause the cache miss instead. 
This change in liming may cause two events in the I 'PI that 
were concurrent in the original case lo he separated by many 
stales in the modified case. Removing insl ructions may also 
have the effect of changing register dala patterns that may 
have been required for I he failure. If an add instruction thai 
sets up a OxOFOFOFOF ilala pattern in a register is removed, 
that register will contain its initial value instead — different 
from the pattern set up by the add — and the failure may not 
occur. 

Kxpcri nls with the failing code are still very important. 

Removing irrelevant instructions and data can narrow llie 
SGardh for the failure. It is possible thai a large number of 
instructions in the failing sequence can be removed w ithout 

affecting the failure. The data patterns in the source registers 
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for one or more instructions might be changed without 
affecting the failure. Each of these changes narrows the 
Search for the failure mechanism. Very slight chiuiges in Ihe 
code sequence or data patients can provide information on 
what events are necessary for the failure. If we change only 
one bit in one data patient and Ihe failure goes away, that is 
a big indication that the failure requires one bit to be set a 
certain way. Another useful step to narrow our search is to 
determine if any specific Cl'l ' features sire involved in the 
failure. For example, if we can turn off a specific C'PI " fea- 
ture, such as bypassing a register value from a pipeline 
stage, and Ihe case now passes, we might say the failure is 
occurring in the bypass logic, or at least the timing of Ihe 
case requires a bypass. 

While we are doing the code experiments, we may use some 
of the on-chip test and debugging circuitry to get a belter 
picture of Ihe failure. Hy running the chip in both a passing 
region and a failing region and comparing Ihe two runs, we 
can gel a picture of where the failure starts. 

USing all of the data gathered, we can begin to see Ihe over- 
all picture of the failure. We know under whal conditions the 
failure will occur, including frequency, voltage, temperature, 
and process parameters. We know a Short code sequence 
that will fail, and what CPU features and liming affect the 
failure. We have a partial picture of the internal stale of the 
failure by comparing passing and failing runs. 

At Ihe same time that we are gathering dala aboul the failure, 
we are developing a new lest case for this failure, This new 
case will use the known requirements for Ihe failure to 
occur, including specific instruction sequences and timing, 
specific register values, and cache hits and misses. When 
our new case fails, we have all the elements of Ihe failure. 

Expanding the Failure Region 

In most instances, the particular failure that occurs in a ran- 
dom Instruction sequence with random data patterns is not 
the worst -case failure. We would like to know how severe 
the problem is. One of our goals is to find the worst -case 
vector. Failures that were previously outside the operating 
region usually move into or very close to the operat ing re- 
gion with a worse vector. For example, a speed failure may 
occur at a significantly lower frequency when a worst-case 
data patient is used. ( >r maybe a failure thai only Occurred 
at 40'C will now occur at 20°C. Expanding the failure into 
more general shtnoo conditions also helps in gathering more 
data. (It's not much fun to probe signals in a -KIT oven. I 

Electrical problems are often heavily Influenced by data 
patterns. For example, driving different data patterns on an 
internal bus may increase or decrease the capacilive coupling 
or delay of a signal thai is causing Ihe failure. It is unlikely 
that the random data pattern used in the original failing case 
is the absolute worst for this particular failure. Complicating 
Ihe matter, it may not be clear what other signals could be 
affecting Ihe failing signal. 

There are some cases in which the failing signal may not be 
appreciably affected by any other signals. In other cases, the 
failing signal, w hich might be part of a bus, may be influenced 
by cenain data patterns on that bus. For instance, some of 
the signals thai make up Ihe bus may capaeiiively couple lo 



the failing signal in that same bus, slowing down the failing 
signal or inverting its value. Occasionally, Ihe biggest influ- 
ence on Ihe failing signal is a bus that is functionally unre- 
lated to Ihe failing signal, but is in close proximity physically. 
The same type of capacitive coupling can occur in litis case. 

Changing obvious dala patterns— instructions themselves, 
operands from registers, and dala results from Ihe ALl" or 
the cache — is Ihe first step. If none of these seem to affect 
Ihe failure, Ihe layout can be consulted lo see whal buses or 
signals run adjacent to or on top of the Victim signal. Finding 
one or more dala patterns that influence the failure also 
allows a better understanding of Ihe failure. 

Hypothesizing the Cause 

In hypothesizing the cause of Ihe failure, all of the informa- 
tion that has been acquired aboul Ihe failure will be used. 

The minimum code sequence is especially useful to the cir- 
cuit debugger. First, it provides a list of code sensitivities 
that either turn the failure on or off or expand the failing 
region. Second, this code can be simulated by a switch-level 
simulator to give the debugger full observability into Ihe 
on-chip circuits being exercised by the test case. By compar- 
ing simulations with and without the code sensitivities, the 
exact effect of Ihe code on Ihe circuits is observed. 

The switch-level simulator is the first tool used by the circuit 
debugger. The exercised circuitry is compared together 
with Ihe internal stale differences from Ihe passing and fail- 
ing data captures to narrow down the circuits involved. This 
process yields several potential code experiments thai will 
continue lo narrow the playing field. At this slage in the do- 
bugging cycle, the circuit debugger is working side-by-sidc 
with the system debugger to isolale the failure. 

Eventually, Ihe circuit debugger makes a rool-eaiise hypoth- 
esis as lo Ihe cans.' of the failing behav iaC This hypothesis 
can frequently be supported by the switch-level simulator. 
For example, if ihe hypothesis is that a certain latch fails to 
make setup, this latch can be forced to fail in the simulator. 
Tlie resulting simulated failure mode should match the aclual 
failure mode. This is the point when the circuit debugger 
moves tO SPICE as the main debugging tool. 

SPICE is used to simulate the isolated failing circuitry in the 
appropriate failing conditions. In the latch example, the 
dock and data paths into the latch are accurately modeled 
in an attempt lo reproduce the failure in SPICE under the 
same conditions as on real silicon. Differences are assumed 
to be either inaccuracies in the modeling or mistakes in the 
root-cause hypothesis. Obviously, these differences need to 
be explained before root cause is declared. 

If the failure is frequency dependent, another way lo validate 
the hypothesis is to stretch the specific clock phase during 
which we believe the failure occurs. By stretching a clock 
phase, we provide more time for the CPU to do the required 
work of thai phase. For instance, if we have a speed failure 
related to one specific phase and we lengthen that one phase 
by H)%, the failure should get better. 

We continue to gather data and hypothesize causes of the 
failure until our hypothesis passes the root-cause test, 
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Declaring the Root Cause 

Often, by inspect ion of I he failing c ase in the switch-level 
simulator or SPICE, sensitivities to local circuit behavior are 
predicted. This prediction can then be verified by targeting 
code to induce this behavior. For example, we may try to 
precondition a particular bus with a specific data pattern 
such that it no longer transitions as in the failing case. Or we 
may change the instruction timing of the case such that two 
events no longer occur in the same cycle. If we can turn the 
failure on and off (the light-switch test) by changing one of 
the known sensitivities, we have a good understanding of 
the failure. 

Throughout the debugging cycle, all facts and observations 
are documented thoroughly in a bug database. This database 
is used to drive experiments to fill in data where it is miss- 
ing or to explain observations. The entire w eight of this data 
is compared to the root-cause hypothesis for consistency. 
Any data point in conflict needs to be explained before the 
root cause of the hug is considered known. This diligenc e 
to the data has avoided many premature and wrong root- 
cause analyses. 

Once we have demonstrated a light-switch test and our hy- 
pothesis agrees with all of the data, we are at the root cause. 
We believe we fully understand (he failure. 

We then revisit the worst -case vector analysis one final time. 
Even if our final worst -case vector does not move the failure 
into the operating region, we may fix the problem anyway, 
because il is hard to determine if a small process shift later 
in litis product's life could move this failure close to or into 
the operating region. 

The next step is lo fix the problem. If we can fix Ihe problem 
wilh only a change in the metal layers. Ihe turnaround time 
for new chips is much shinier. To verify that the proposed fix 
will actually eliminate the failure, we may do a FIB (focused 
ion beam) experimeni, in which one of Ihe existing chips is 
modified (metal lines are cut and new ones are deposited) 
to include the change. The chip is characterized before 
and after the KIB change to determine how Ihe failure was 
affected. If the failure was eliminated, we have good confi- 
dence in our fix, and we will put our fix into the plan for the 
next CPU rev ision. 

Creating the Golden ROM 

After a bug has been closed, its worst-case test is added to 
our HOM of all other worst-case tests that have exposed 
bugs. We call this ROM the golden ROM. We use the golden 
ROM for much of our volume shmoo testing, to serve a Dum- 
ber of purposes. Il show s where the current bugs were found 
and can show how a fix or certain chip characteristics could 
affect these failures. Il also lets us know if a bug has been 
reintroduced, w hich happens on occasion. As the golden 



ROM grows in size, it naturally gives us more coverage. 
Many of our new bugs are found by running the old bug 
code in our golden ROM. If a test case has important cover- 
age that we do not have in our tester screens, golden ROM 
tests can be converted into broadside vector tests for our 
package screens. 

I pdating the Methodologies 

When the root cause or problem circuit has been identified, 
it often uncovers a flaw in our design methodologies. This 
implies that other similar circuits may be used on other parts 
of the chip but haven't been discovered yet- The aphorism 
"If there's one rat. there are many rats" becomes our motto. 
A "many rats" investigation is launched to find a tool-based 
method of extracting similar circuits from the chip database 
and fixing them if appropriate. Quite often, the failing Cir- 
cuitry has a unique topology dial can be searched for with a 
tQoL Finally, this flaw in the design methodologies is docu- 
mented and the methodologies are updated. 

Conclusion 

We continue ihe electrical verification process until wo have 
searched our matrix of v ariables — temperature, frequency. 
VOltage, process, and lest cases — and we can find no more 
failures that we believ e could move into the operating region. 
This process spans multiple chip revisions, with each new- 
revision fixing one or more failure mechanisms. This process 
ensures the long-term quality of the product throughout its 
lifespan. 

In addition, we analyze the problems thai we found and inte- 
grate the solutions to these problems into our design method- 
ologies so thai future products can avoid the same pitfalls 
and potentially reach high quality levels more quickly thai) 
previous products. 
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Solving IC Interconnect Routing for 
an Advanced PA-RISC Processor 



This paper discusses some important new block routing technologies that 
were required for the HP PA 8000 processor chip. These technologies are 
implemented in a new block routing system called PA_Route. 

by James C. Fong, Hoi-Kuen Chan, and Martin D. Kruckenberg 



The design complexities of today's microprocessors have 
grown significantly, with the number of transistors climbing 
to well over a million, silicon die sizes larger than 1.6 cm-, 
clock speeds exceeding 150 MHz, and short design cycles 
caused by competition. These issues create tremendous 
pressure on design teams and the touts they use. The PA 8000 
CPU design team used powerful design automation tools to 
achieve their design goals. 

Layout of the interconnect metal on the chip is one of the 
key components of advanced designs. It is vital to address 
the increasingly complex layout problem to achieve smaller 
die sizes, higher performance, and quicker time to market. 
Since the early 1980s, HP has been working to solve the top- 
level IC interconnect problems associated with many of the 
larger HP-designed and HP-manufactured ICs. This paper 
will discuss some important new block routing technologies 
that were required to implement the HP PA 8000 micropro- 
cessor. These technologies are embodied in a new in-house 
block routing system called l'A_R<iute. 

Buy or Build Decision 

Frequently we at the HP Integrated Circuit Business Division 
(1CBD) are approached by HP design teams who are about 
to embark on the design of a new chip to be manufactured 
by ICBD. We are asked to enhance our routing technology 
to address issues critical to the chip's successful routing. 



Routing 




This was the case when the PA 8000 design team ap- 
proached us with some ideas for new features needed to 
take adv antage of new technologies. Like most aggressive 
designs, they were pushing the limits of every technology 
where I hey thought they could get a significant return on 
investment. Block routing was one area they thought they 
could improve. 

Our existing block router is called HAKP (Hewlett-Packard 
Automatic- Routing and Placement ). HAKP had been evolv- 
ing for over a decade and had some legacy code t hat was 
becoming diffic ult to extend. 

Using customer surveys to complement our own knowledge, 
we did a detailed analysis of v arious existing block routers. 
We wanted to see how they address this new class of block 
routing problems. Design teams are hesitant to switch 
layout tools unless alternatives can be found that match 
their design requirements well enough to justify the risks of 
switching tools and the cost of the new tool, not only the 
capital cost but also the cost of learning how to use the tool 
effectively. Using radar charts (see Fig. 1 ), we were able to 
determine that the less aggressive style of chips represented 
by the PA 7100LC processor were w ell-suited for the existing 
I LAPP system. The more aggressive style of chips represented 
by the PA 8000. however, did not map to any existing block 
router offerings. 



Routing 
Completion 




(b) 



Fig. 1. Radar charts showing (a) the capabilities of HP's existing HARP block router and third-party routers and 
(b) lite needs of less aggressive chips like the HP PA 7KM1LC and more aggressiw chips like the HP PA HOOP 
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Significant changes in functionality and features usually 
entail a high amount of risk. On arty design it is critical to 
manage the risk. Being ;in internal supplier, we are able to 
work much more closely with our customers hy giving them 
greater visibility and control of the risks involved This level 
of access is generally not available when dealing with third- 
party tool providers. 

K'BI), as Ill' s internal chip supplier, is in the business of 
making and selling chips, not tools, so we face stricter 
requirements to justify any internal tool development. It 
is rarely cost-effective to build a router for a single chip. 
However, it has been our es|H*rience that the micropro- 
cessor chips have always pushed the limits < if t he technolo- 
gies and the more general ASH' chips follow later after the 
bumps have been smoothed out. In looking at what wassjH*- 
cial about the PA SHOO, we spotted several new technology 
trends that radically changed the block routing problem anil 
might be adopted by future ASIC chips. 

First, fabrication processes are stalling to add many more 
layers of metal for interconnect. This change begins lo inv al- 
idate the basic model used bj traditional block routers of 
separate routing channels and blocks* SCCOnA the need for 
higher off-chip connectivity is forcing a change in packaging 
technology. Solder bump packaging looks like Ihe most viable 
means of addressing that need. Solder bump packaging is 
also being looked at for reducing packaging cost by mounting 
chips direcilv to boards. However, having solder bump pads 
in the middle of the chip breaks the traditional block rouici 
model of placing the pads al Ihe periphery of the Chip. Lasj 
but by no means leasi is a general trend of wiring delays 
becoming more significant than gale delays. Thus, the 
emphasis of routing is switched from minimizing chip area 
lo minimizing inlercoimeci delay. 

Working Wjth our I'A 8000 customers, we prioritized ihe 
features and came up with a manageable subset needed 
for them lo be successful. We then circulated a proposal to 

build the I'A Route block routing system Given ihe time con- 
straints and Ihe ambitious goals, we had lo lake the drastic 

step of freezing the old system, MAUI', with minimal support 
We go i agreement from ail panics on the basis of strong 

support from Ihe I'A S(ltll) developers. 

New Technologies bead to New Constraints 

The PA SHOD design learn hail decided that lo be competitive, 
Ihe PA Klltlll chip would not only be more aggressive in Us 
design, using superscalar, out-of-order insl ructions, but 
would also use a new process and new packaging. Il is not 
uncommon for a microprocessor to use a new process, but 
this time they were moving from a Ihrcc-mctal-layer process 
to a five-metal layer process. In addition, Ihe increased I/O 
re quire ments of Ihe design ruled oul Conventional packaging. 
Thi' only mature packaging approach available was solder 

b p technology. With solder bump technology, ihe i/o 

pads are spread across Ihe whole chip and are not jusi 
restricted to Ihe periphery. 

Analysis of Ihe possibilities for extending or adapting the 
existing block router in Ihe HARP system showed thai ils 
basic design intentions were nol w ell-malched With Ihe new 

requirements and could not be changed to make full use of 
the new technologies. IIAKP was based on Ihe traditional 
channel routing paradigm, in which Ihere are expandable 



routing channels between solid blocks. The channel router 
was United to nailing in three metal layers, while the new 
proc ess had fiv e The solder bump l < > pads could be any- 
where, but the block router could only attach to ports on the 
edges of a block. 

Another more Important restriction was thai the solder bump 
port frame could nol change- and ihe blocks could not move 
willi rasped to the pads. There were two reasons for litis. 
First, the board to which ihe PA HtMMi was to be connected 
had a relatively long lead time, so ils designers could not 
wail for the chip lo be conipleled before getting started. 
Secondly, the solder bumps used to conned to the l'< > pads 
emit alpha particles. If the placement were changed so that 
ihe pads became coincident with sensitive circuitry then 
Unpredictable circuit behavior could result. 

These requirements meant the block router could not grow 
the channels or move the blocks, a constraint for which our 
existing block router had only weak support We joint!) de- 
cided it was not feasible io attempt to automate ihe routing 

of I In- fifth layer Of metal, since it was overly complicated by 
Ihe requirements of the solder bump IA ) pads. 

In I'A_Roule we addressed ils manv of the new requirements 
as we could in Ihe lime available. We worked With the 
PA 8000 design team lo pick the most Important issues to 

address. This came dow n 10 two major features- being able 

lo use the third and fourth metal layer resources over the 
top of some of the blocks and being better able lo control 
the growth Of the placement. 

i >ur team is not often given the time to redesign our system, 
so we look advantage of ihe opportunity to add some long- 
desired capabilities. We added a more sophisticated port 

and net model, which we call. foliiiiic Willi foliage we can 

describe Ihe electrical characteristics ofa port and a net. 

Pieces of artwork representing a port, for example, can be 

considered electrically equivalent (allowing siiiching). elec- 
trically resistive (allowing connection lorn I' many bill 

without SlitChihg)i Or electrically open (specifying thai all 

pieces of artwork musl be connected), foliage allows the 

routet tO be more flexible in using ports, since il uses this 
electrical model of the pons and il allows for more complex 
routing of nets in a channel. We also look the time to use 
more advanced software development techniques. We 
switched our design style from structured custom program- 
ming in the Ada language to object-oriented programming in 
Ihe C i 1 language. This allowed us lo allenipt more complex 
algorithms and reuse existing component libraries. 

The Building of PA Route 

The PA_Roule .system is composed of many components, 
including a netiisl reader, an artwork reader ( which models 
obstacles), a global rouier. a channel scheduler, and a 
detailed router, A v iewer is used to examine intermediate 
ami linal results. Eventually the artwork is produced and 

then verified. Even thOUgh the design lime of Ihe PA soon 
is long Compared BO most ASIC chips, we did nol have lime 
to rewrite the w hole system, so only llin-e main parts were 
designed and Implemented from scratch: the main database, 
Ihe global router's over-t he-block grid model, and Ihe new 
ov er-l he-block detailed rouier We leveraged the resl of the 

system from ihe old harp system w ith modifications to 
Interface with the new database- 
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Global Routing — A Block-Level Problem 

For a given level within a chip hierarchy, the routing plane is generally 
occupied by a number of blocks with ports that need to be connected by 
physical wires on nets. The blocks are usually restricted to be rectilinear 
in shape but are allowed to vary in size. The space between the blocks 
is generally reserved for routing and is usually subdivided into adjacent 
routing regions or channels so that the routing problem can be solved 
with a divide-and-conquer approach. Within each region, a certain 
number of layers are reserved for routing the signals at the given level. 
Global routing is the first step in the routing process. Its job is to generate 
a routing plan in which each signal is assigned to a number of routing 
regions. The objectives for the global router are to achieve 100% assign- 
ment of signals to available routing regions, to minimize the overall chip 
size, and to ensure that the timing requirements of the signals are met. 

The global routing problem is generally represented by a global routing 
graph, which depicts the relationships between routing regions and ports 
to be connected. The edges of a global routing graph represent the rout- 
ing regions and the nodes represent the intersections between regions 
and the ports. The edges are assigned weights, which can be the dis- 
tance between two region intersections, the distance between a port 
and the nearest region, the cost for using a particular layer in the region, 
a penalty for switching routing layers, or a penalty for overflowing a 
region. Global routing is accomplished by implementing a lowest-cost 
path-finding algorithm on the global routing graph. 
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Fig. 1, Global routing graph Regions are indicated by dashed outlines. 



To minimize development time, we partitioned our develop- 
ment team into two parallel groups. One group started on 
the new database and starled porting the old programs, 
while the other started implementing some of the new global 
router features in the old system. When most of the old 
HAKP system was ported we ported the modified global 
router. This meant that the global router stayed in structured 
custom Ada code, which was the language used in HARP. 

Database Changes 

The capability needed by the PA 8000 design to route over 
the blocks required us to improve the expressiveness of the 
underlying database models used in routing regions. The new 
database allow r s us to model the obstacles and internal ports 
t hat we see in these over-the-block regions. The advanced 
port and net models (foliage) wc implemented also required 



significant changes lo the database. This not only allows 
us greater control and flexibility in routing, but also allows 
us to separate the act of global routing from the channel 
scheduler I hat calls the detailed router. 

We developed automatic code generation technology to 
transform a graphical model of the database into code. The 
code generation technology was extended to support the 
C++ language and we began to work on the new input and 
output programs. With the database changes completed, 
we could begin porting the old HARP programs to the new 
database. 

Global Routing 

The genera] global routing problem is described at right. 
PA_Route incorporates a global router that understands 
rectangular blocks. The global router needed to be extended 
to support L-shaped blocks for the PA 8000. An L-shaped 
block is cut either horizontally or vertically into two 
rectangular components and special control is imposed 
on the channel between the cut components to keep the 
components linked together. The routing plane is divided 
into rectangular routing regions that meet only at T-inter- 
sections such that only two sides of a rout ing region have 
constrained ports. This somewhat restricted routing model 
comes from a conscious decision to avoid situations in which 
a routing region becomes a "switchbox" with ports on all four 
sides constrained to fixed locations. The more constrained 
switchbox routing problem generally requires more run 
time, creates more constraint cycles, demands clever rip-up 
and reroute strategies, and tends to leave more shorts for 
manual repair. We opted instead to concentrate our effort on 
providing more flexibility in the PA_Route global router for 
meeting user requirements and lor achieving the smallest 
possible overall chip area. 

The old HARP global router, like any traditional global router, 
assumes that blocks are black boxes, that the points for 
connections are on the edges of the black boxes, and that 
routing is confined to the channel areas between the blocks. 
That is, all routing resources inside the blocks are dedicated 
to the blocks' internal implementation only and therefore 
routing at the global level is not allowed to traverse through 
the blocks. This simplistic assumption was largely accurate 
in the days of two-layer and three-layer IC processes. With 
the advent of the HP CMOS14 process, which can have up 
to five routing layers, the assumption that routing resources 
inside a block are dedicated only to the block is no longer 
realistic. For the PA 8000, a good amount of metal 3 and 
metal 4 resources inside some child blocks are available for 
routing global nets. 

Being able to use such over-the-block routing resources can 
lead to reduced signal timing, decreased channel congestion, 
and ultimately smaller overall routed chip size. Having judged 
that over-the-block routing was a critical factor to the success 
of PA 8000. the PA_Route team undertook a revolutionary 
change in the global router to support the routing of global 
nets over any block, provided that there are routing resources 
available over the block. The traditional global routing graph 
was augmented with a virtual grid model over each child 
block, a sophisticated net flow optimizer, and an efficient 
routing resource estimator. The grid model allows the lowest- 
cost path of a global net to traverse through any region over 
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a block as long as there are free routing resources. The glottal 
router huilds a detailed model of routing resources in each 
region (channel or block) and tracks free spaces in the re- 
gions based on a sophisticated density estimator that under- 
stands obstacles. The net flow optimizer minimizes jogging 
and distributes unavoidable jogs of ilifferent nets to different 
regions to reduce congestion. For connecting to the new- 
solder bump VO pads, which are inside some child blocks, 
the new global router was extended to support ports inside 
any block, with the restrictions that the ports be on selected 
port layers and that there be available routing resources in 
the block. The gli ibal router takes care of avoiding obstacles 
and ports on the edges of a block when inside ports are 
brought out of a block to form a lowest-cost path. The 
net flow optimizer also plays an important role in choosing 
an optimal exit point for the inside port so as to reduce 
unnecessary jogging. 

The predetermined solder bump M ) pad locations for the 
PA 8000 force the placemen] to be unperturbed during rout- 
ing. This is a hard problem for the global router. Not having 
the luxury Of actually routing the nets during global routing, 
utilization of routing resources over the block as well as 
in the channel regions is controlled using a close estimate 
of detailed routing. A reasonably accurate anil fast density 
estimator was incorporated into the PA_Route global router. 
Since rt mting over the block is allowed, the density estimator 
must understand prerouting ami obstacles. A density check 
phase was introduced after the evaluation Of the lowesl-cost 
path of a net. If the path would exceed the routing capacity 
of one or more regions, the grid model in the global routing 
graph is modified to forbid further routing through the con- 
gested portion of the regions and an alternate lowest-cosi 
path is sought. This c hecking process is repealed until either 
a clear path is found or no clear path is found, in which case 
the net is left unrouted to preserve the fixed placement 
in addition to performing accurate density calculations, the 
global router also attempts to achieve minimal placemen! 
perturbation by automatically assigning nets to the less- 
dense layer wherever there is more than one routing layer 
available, and by optimizing nel How at region interfaces to 
reduce routing congestion. 

For muliipiii nets, for which connect ivily can have significant 
perlbnuance.mil density implications, port foliage was added 
to I he PA_Roule database to give the global router a model 

foi determining port equivalency, while net foliage was in- 
troduced to allow the global router to generate more sophis- 
ticated physical connectivity for the shortest path. This com- 
bination of port and net foliage results in a high degree of 
control over the physical connectivity of a net. A designer 
ran specify foliage explicitly or allow the I'A Route global 
router the freedom to optimize the global route by creating 
foliage as necessary to minimize the total wire length and to 
avoid congestion. 

Over-the-Block Routing 

'I'o handle tw o important aspects of the PA 8000, a new over- 
the-block detailed router was required. The router had to 
handle obstacles in any routing layer and it had to be able to 
connect lo polls located anywhere w ithin the routing region 



Detailed Routing Methods 

Detailed routing has generally evolved out of four basic approaches: 
mare routing, line probe routing left-edge routing, and greedy channel 
scanning The problem is formulated as a routing area containing con- 
nection points or pins on a rectilinear (usually rectangularl region or 
channel Pins can be located on any ol the lour sides of the region or 
within the region The connection points are generally constrained to 
reside in certain layers to make them easier to connect to 

Even smgle-layer routing problems are NP-complete. which means that 
an optimal solution cannot be achieved in a reasonable time For this 
reason, detailed routing solutions are heuristic in nature The factors in 
determining a solution's usability are the number of terminals, net width, 
via restrictions, boundary shape, number of layers, and net types such as 
power, ground, and clock wires 

Maze routers abstract the channel routing problem with a grid-based 
model Wires are restricted to follow paths along the grid lines 
Routing is accomplished by laying down wires on the grid one at a time 
Obstacles are modeled as disallowed portions of the grid Therefore, 
maze routing can handle arbitrary obstacles 

Line probe routers scan in the x and y directions searching for line seg- 
ments from either the source or the destination Scan lines do not project 
beyond obstacles, so obstacles are avoided by a subsequent probe of the 
line segments orthogonal to the ones from the previous pass 

Left-edge routers sort wires by the boundary formed by the leftmost and 
rightmost pins It orders wires one at a time using a greedy method dial 
places segments into tracks. It fills tracks one at a time, packing seg- 
ments to minimize unused space in a track. The route is complete when 
all wires have been assigned to a track. 

Greedy channel routers divide the channel into horizontal tracks and 
vertical columns This approach works on one vertical column at a time, 
scanning from left to right The approach is termed "greedy" because 
each column is optimized individually, although the entire channel is not 
guaranteed to be optimal The greedy router sweeps from column to 
column, trying to join segments of nets assigned to multiple tracks The 
greedy channel scan is capable of providing fast solutions but cannot be 
easily extended to handle arbitrary obstacles. 

The over-the-block detailed router used in PA_Route uses a completely 
different approach based on a graph The graph represents horizontal 
and vertical constraints ol the wires 



Restrictions on the topology of the obstacles and ports were 
negotiated with the I'A 8000 design team to relax the con- 
straints of the over t lie-block router lo meet an aggressive 
schedule. 

Child blocks were constructed by lower-level composition 
Irani-. Thru design used the lower layers ol metal to perform 
local interconnect and the upper levels of melal for inter- 
mediate levels of interconnect. The result was that partially 
used layers were made available lo the over-the-block router 
lo complete the intermediate and global level interconnect 
The over-the-block router was given Che responsibility of 
avoiding art work created by the lower-level composition 
teams. 
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Fig. 2. To i he ovcr-t he-block 
detailed router, each wiring 
trunk is a node in a graph (a). 
The edge* in [he grapli model 
the vertical pin constraints of 
each Crunk an<1 the horizontal 
constraints of that trunk's place- 
ment relative to other trunks. 
(Ii) Routing plan. Shades of gray 
represent different metal layers 



A review of existing detailed routing algorithms is presented 
on page 43. The PA_Route over-the-block router is built on a 
new routing algorithm. It is based on a channel-like paradigm 
although il handles obstacles With arbitrary configurations. 
Layers are generally assumed to run in either the x direction 
or the y direction. Wires are routed assuming one direction 
is preferred, that is, in the preferred direction the wire runs 
for a longer distance and carries most of the current between 
a source and its sinks. By handling obstacles in arbitrary 
configurations, the over-the-block router extends channel 
routing concepts into an area-based routing regime. It re- 
tains many of the benefits of channel routing while being 
flexible enough to handle more complex routing topologies, 
'litis type of routing methodology will become more common 
as more layers are made available for routing. The over-the- 
block router can connect to ports not only on the sides of 
fouling regions, but also in the middle of a routing region. 
The over-the-block router supports variable wire width and 
spacing, which gives the designers greater control over the 
liming delays of a signal. The over-the-block router reads the 
layout rules directly and does not abstract them into arbitrary 
routing constraints. Unlike other algorithms, our proprietary 
over-the-block algorithm docs not require wires to be 
"binned" according to their width and spacing, and it does 
not rely on a compaction process to achieve optimal density. 

The over-the-block router contains the features needed for 
high-speed, performance-driven critical designs such as the 
PA 8000. It handles complex via structures necessary for 
high-performance designs by allowing the intersection area 
to be expanded beyond the size dictated by individual metal 
connections. While it supports a high degree of manual con- 
trol, the over-the-block router is also reasonably fast, making 
multiple design turnarounds feasible. 

The over-the-block router models the routing problem as two- 
dimensional line segments that represent the largest wiring 
component of a net. This trunk is assumed to cause most of 
the parasitic delay and the overall goal of the algorithm is to 
find an optimal ordering of these trunks to generate a dense 
packing and avoid obstacles. Each trunk and each obstacle 
becomes a node in a graph. The etlges in the graph model 
the vertical pin constraints of each wire and the horizontal 
constraints of that trunk's placement relative to other trunks 
{see Fig. 2). The total graph contains the weighted con- 
straints of all trunks in the routing region. Thus, each trunk is 
considered for placement during each phase of edge direc- 
tion assignment, and the net ordering difficulties of other 
rouling schemes are avoided. The general nature of the edge 



selection allows other constraints such as cross talk and 
delay to be modeled in future versions. 

The algorithm can handle any number of layers and is not 
rigidly required to follow layer-per-direction constraints for 
vertical components (i.e., connections to ports) or trunk 
components. When constraints occur, the over-the-block 
router tries several schemes to alter the topology of the wire, 
such as removing the constraints. The scheme includes joy 
insertion and irmuy-side sryniniliny in various forms. 

U 1 1 it* over-the-block router cannot complete a route, it pro- 
duces a spacing violation or short circuit along with associ- 
ated diagnostics and completes the route. When this occurs, 
the user has the opt ion of fixing the short manually or alter- 
ing the routing problem for the region by such methods as 
growing the placement, constraining the global routing with 
capacity controls, or other means. 

Although specialized to handle the routing problems of the 
PA 8000, the over-the-block router was built to handle the 
general channel routing problem. No shortcuts were taken 
that would compromise robustness for the general case in 
the expectation that the router could be leveraged for other 
designs. 




Fig. 3. PA 8000 CPU chip with highlighted areas showing 
where PA_Route performed block-level routing. 
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Conclusion 

A new Modi routing system called PA_Routc was liuill spe- 
cifically to address the needs of high-performance, leading- 
edge It' designs l'A_Rouie contains signirieanl features 
(null on new leclinology while leveraging existing code lo 
minimize risk. Il wa-s designed lo he extendable to address 
future issues as they arise. It was used successfully lo route 
die PA .SINK) chip and did mil unpad us schedule despite the 
high levels of risk involved. Fig. 'A shows the areas of the 
PA SIKH) elup where PAJioule performed block-level routing 
The features anil liinilalions of ihe system were carefully 
designed w ith close coo|>cmtioii del ween the PA SIMMI design 
team and the CA1> developnieni leant. Many ;illeiualives 
were analyzed using design-critical issues as the measure- 
ment Criteria. Balancing immediate and fulure chip design 
needs was given high importance in the design of PAJtoUte 
so that the system can continue lo he used lor future designs. 
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Intelligent Networks and the HP 
OpenCall Technology 

The HP OpenCall product family is a portfolio of computer-based 
telecommunications platforms designed to offer a foundation for 
advanced network services based on intelligent network concepts. This 
article concentrates on the HP OpenCall service execution platform, 
service management platform, and service creation environment. 

by Tarek Dehni. John O'Connell, and Nicolas Raguideau 



Intelligent networks arc an expanding area within the tele- 
communications industry. The adoption of intelligent net- 
work technology has been driven by its ability to allow tele- 
communications network operators to install and provision 
new, revenue-generating communication services in their 
networks. Willi these services installed within Che network, 
the extra functionality they provide can easily and instanta- 
neously be made available to the whole customer base. 
Examples of such services are the freephone services (the 
cost of the telephone call is paid by the called party), credit 
card calling, and the CLASS serv ices (custom local area 
signaling services) in North America. 

At the same time, the standardization of some key interfaces 
within the telecommunications network has allowed greater 
competition between network equipment providers, offering 
the possibility of genuinely multivendor networks. The 
opening up of previously proprietary switch interfaces has 
made it easier for network operators to add new functionality 
to their networks, since this functionality can now be imple- 
mented outside the switch, often on industry-standard com- 
puter platforms. Today, with the emergence of new fixed 
and mobile network operators in many areas of the world, 
two new drivers for intelligent networks have emerged. First- 
ly, there is the need for interoperabiliiy between these net- 
works. Secondly, operators seek to differentiate themselves 
on their service offerings. Both imply an even stronger re- 
quirement to Support extra intelligence in the network. This 
will ensure the continued demand for more open and flex- 
ible intelligenl net work solutions. 

Hewlett-Packard's product strategy for the intelligent net- 
work market is based on the IIP OpenCall product family, a 
ponfolio of computer-based telecommunications platforms 
designed to offer a solid foundation for competitive, revenue- 
generating services based on intelligenl network architec- 
tures. This article concentrates on the HP OpenCall service 

execution plot t'nriu. serein* manageyneiu platform, and 

Service creation environment, with particular emphasis 
on the architecture and design of the service execution plat- 
form. The HP OpenCall .S.S7 platform is described in the 
article on page 58. 



In this paper, we introduce the key concepts in intelligent 
networks including the role of standardization, we explore 
the system requirements for a class of intelligenl network 
elements (those elements targeted by the HP < >pen('all plat- 
forms), and we highlight the key aspects of the design of the 
IIP OpenCall platforms. 

Intelligent Networks 

The telephony service is very simple. Its objective is the 
transport of speech information over a distance in real lime. 
Telephony networks were originally designed with the 
assumption that the same service would be offered to all 
users, and this held true for a long time. Users could select 
one of a range of destinations and be called by other users. 
< >ver the years the focus of telephony Service providers has 
been to improve the technology to offer these basic services 
to a larger number of customers, and over longer and longer 
distances. At the same lime, terminals have become mobile, 
with mobile phone users demanding the same levels of 
services. 

As a consequence of this evolution, today's telephony net- 
works consist of a mix of more or less integrated technolo- 
gies and networks that have been deployed over more than 
:f() years, forming a very complex and large-scale global 
infrastructure. 

In this context, the task of provisioning a new service in an 
operator's network is extremely complex and may vary con- 
siderably depending on the network infrastructure. The con- 
ventional method of continually integrating these new func- 
tions into public exchanges is costly and lacks flexibility, 
making it difficult for network operators to compete effec- 
tively in an increasingly competitive environment. 

This situation has led network operators and their suppliers 
to look for a better approach, in which control functions and 
data management linked to the creation, deployment, atid 
modification of services can evolve separately from the basic 
switching or existing functions of an exchange. They as- 
signed standards organizations ( ITU-T, ETSI, Bellcore) the 
responsibility of defining an architectural framework for the 
creation, execution, and management of network services. 
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Information Flows 




Service Plane 

• Eitraction ol service features 
to clarify service requirements 

• S ■ Service 

• SF = Service Feature 



Global Functional Plane 

• Service enecution description 
making use ol SIBs and BCP 

• BCP = Basic Call Process 

• SIB = Service Independent 

Building Block 

• POI = Point of Initiation 

• POR = Point ol Return 



Distributed Functional Plane 

• Basic Call State Model IBCSMl 
models basic call process 

• Description of service execution 

• FE = Functional Entity 



Physical Plane: 

• Physical implementation 
scenarios for intelligent 
network functionality 

• Define protocol INAP (Intelligent 
Network Application Protocol) 

• PE = Physical Entity 



Fig. l. The Intelligent Network Concoptind Modal ■>( the ITi'-Tisa 
frww'U'iirk fur describing anil s|>nriiyuiK tuteiljgeai network system 

Intelligent Network Conceptual Model 

The I I I T ( International Telecommunications l Inioh — Tele- 
communications Standardization Sector) developed the 
lulelli(jenl Xetirork Conceptual Mattel to provide flic frame- 
work for the description of intelligent network concepts and 
their delations. The Intelligent Network Conceplual Model 
consists of four planes, each of which is a different abstrac- 
tion of the telecommunications neiwork. The ITI'-Talso 
planned tin 1 specification of the target intelligent network 

architecture through several stud) periods, thereby enabling 

incremental implementations. These Successive standardized 
[Unctions are referred lo as iiilrll if/nil network Capability 
SUS (sec Fig. 1). 

Service Plane. The service plane describes services and tin' 
service features as seen from a user perspective. A service 
feature is the smallest pari of a service that can he per- 
ceived by a user. The service plane does not consider how 
the service is implemented or provisioned in the network. 

Global Functional Plane. The global functional plane de 
sail >cs the design of services as a combination of service 
independent building Woctefc Service independent building 
Mucks give a model of the network as a single entity, that is. 
there is no consideration of how the functionality is distrib- 
uted tJVer the network. 

A specific service independent building block is the linsic 

call process, which corresponds to the basic call service. It 

has /minis of initiation and points of return. An instance of 
a service logic can be called from a point of initiation, and 
after execution of the service logic, the basic call process 
is recalled in a point of return. Service logic corresponds to 
services or service features in the service plane. 

Distributed Functional Plane. The dish United functional 
plane (Fig. .t) gives a distributed functional view of the net- 
work. Functional entities are groupings of functionality 



thai are entirely contained in a physical entity. In other 
words, they cannot Ik- split among several physical entities. 
The distributed functional plane describes the functional 
entities together with their relationshi|>s. 

The identified functional entities are as follows: 

• Them// control access J unci inn models ihe interface with 
the end-user terminal. 

• The call control function provides call and connection 
control, dial is, basic call processing. 

• The Service sir itching function models Ihe call Control 
function as seen from the service control function. 

• The service conl nil function provides Ihe logic and process- 
ing capabilities tor intelligent network-provided services. It 
interacts with Ihe service switching function to modify the 
behavior of the basic call, and with the two entities below 
to access additional logic or obtain service or user data. 

• The service data-function contains customer and network 
data for real-time access from the service control (unction. 

• The specialized rrsouire function provides additional spe- 
cialized resources required for inielligeni network-provided 
services, such as dual-tone multifrei|uency I DTMF) receiv - 
ers, announcements, conference bridges, and so on. 

Finally, on Ihe management side, three functional entities 
are defined: 

• The service management function allows for deployment 

and provisioning of inielligeni network services and for 
support of ongoing operations. Its management domain can 
cover billing and statistics as well as service data. 

• The service creation environment function allows new 
services to be safely and rapidly defined, implemented, and 
tested before deployment. 

• The service managemenl access function provides the inter- 
face between service managers and the service management 

fund ion. 

It is envisioned that Ihe service independent building blocks 
specified in Ihe global functional plane will be realized in the 
distributed functional plane by a sequence of coordinated 
actions (0 be performed by various functional entities. 

Physical Plane. The physical plane describes the physical 
alternatives tor the implementation of an inielligeni network 

The Identified possible physical nodes include scrrice control 
points. SioitCheS, and inlellii/enl /leripliemls. 



Functional Entities: 
SMAF Service Management 
Access Function 
Service Creation 

Environment Function 
Service Management 

Function 
Service Data Function 
Service Creation 

Function 
Specialized Resource 

Function 
Service Switching 

Function 
Call Control Function 
Call Control Access 
Function 




Signaling Relationship 
1 Bearer Relationship 



Fig. 2. IntelUgenl network distributed functional plane, showing 
functional entities. 
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Tin- Information Bows between functional entities contained 
in separate physical entities imply a need i<> specify and in 
standardize the interfaces and protocols between these 
separate physical entities. 

The following protocols have been defined by the rn ME 

• The ISDN User, ['an (isi IP) and the Telephony I Iser Part 
(Ti i 1 ) are instantiations of the information flows between 
call control (unctions. 

• The Inielligeni Network Application Protocol (INAP) covers 
the information flows between service switching functions 
with a series of message sets. 

• The information Bow between the service control 
function and the service data function is based on Ifee 
X.600 specifications. 

Except for the TOP, these protocols are network sen ices on 
top of the telephone companies' Signaling System #7 (SS7) 

signaling net works. 
Intelligent Network Rollout 

The general architectural concepts of intelligent networks 

are applicable to % wide range of telecommunications net- 
works including plain old telephony services (PI >TS| net- 
works, mobile communication net works i< ism, PCN, I IE( IT), 
ISDN, and future broadband networks. Furthermore, these 
well-defined architectural principles can also be found in 
standards from oilier organizations thai have defined equiv - 
aient or domain-specific architectures. Bellcore's AlN ( Ad- 
vanced Inielligeni Network) architecture shares many fea- 
tures of Hie ITU-T approach, while ICTSI, for example, has 
Identified various physical nodes (HLR, VLR, MSC) commu- 
nicating via the MAP protocol in mobile networks. 

Although il didn't deliv er all of ils original promises, the 
inielligeni network concept is Considered a success. Today 
freephone remains the major revenue-earning service for 
inielligeni uel works, and il is continuing to grow. Freephone, 
split -charge, and premium rale services still generate almost 

rid'!., of inielligeni network revenue. Another service thai is 
providing significant revenue to network operators is virtual 
private network service, which is currently experiencing the 
most rapid growth. 

Il is interesting thai all of these services share the character- 
istic thai thej require data to be available throughout a net- 
work. This class of service dearly promotes a centralized 
inielligeni network solution. In fact, the fundamental suc- 
cess of the inielligeni network concept is thai it has simpli- 
fied data management of those services for w hich it has 
succeeded (all hough service management remains a rela- 
tively minor consideration in standards organizations). The 
full potential of other types of intelligent network services 
still needs lo lie realized. 

Intelligent Network Element Requirements 

Hewlett-Packard lias developed the IIP OpettCaL service 
execution platform as an open, programmable, scalable, 
highly av ailable, and easily manageable platform that can be 
used as a basis for implementing a range of different ele- 
ments of an intelligent netw ork. The platform provides a set 
of basic f'unclionalilies that are common lo many inielligeni 
network elements. By installing suitable user-defined appli- 
cations or services on it, the UP < IpenCall platform can be 
i'Xt ended lo provide the service control function, service 



data function, specialized resource function, and other func- 
tionality lo oilier nodes of the SS7 nclwork, such as switches. 
Thus, the aim of the III' < IpenCall service execution platform 
is lo provide a platform that can be easily extended to mcel 
the requirements of different inielligeni network elements. 

The common requirements of these different network 
elements are summarized in I he follow ing paragraphs. 

Openness and Flexibility. I We of the kej goals of the Intelli 
gem network is 10 promote multivendor solutions to allow 
technology from different equipment providers to inlerwork. 
In contrast, many of the early inielligeni network solutions 
w ere implemented on proprietary solutions. These applica- 
tions are often not portable across platforms, so Customers 
are often lied lo a single equipment provider and cannot 
always benefit from the latest advances in hardware. 

Furthermore, inielligeni networks are seen as evolutions of 
existing networks, that is, new network elements, imple- 
menting inielligeni network functionality, are expected in 
interwork with existing equipment, This implies thai the 
new elements must support multiple entry points lo ensure 
easy integration wiih other products, both at the SS7 network 
interface and ai the interface with the operations support 
systems that manage the telephone network. 

This need for nmltivendor solutions also drives ihe Stan- 
dardization activity in intelligenl networks (see "Standard- 
ization — A Phased Approach* on page 18). In theory, if Ihe 
interfaces between network elements are standardized and 
clearly defined, inlerworking should be easy. However. de- 
spite the existence of multiple Standards, local differences 
make il necessary 7 for a platform to adapt lo many different 
environments in terms of connectivity, protocol support, and 
management Furthermore, there is no standard environment 
in the leleconununicalions central office. Bach central office 
often has ils ow n collection of legacy systems. This implies 
a need to lie able lo add nel work-specific, protocol-specific, 
and environment -specific intelligence lo meet customer 
requirements. 

Rapid Service Deployment. In the increasingly competitive 
telecommunications market, network operators see the 

need to different iale themselves on their sen ice offerings. 
Operators want to be able to define, deploy, and customize 

services quickly, while si ill guaranteeing the high levels of 
quality and reliability that have traditionally been associated 
w ith telecommunications networks-. There is a need to sup- 
port all aspects of Ihe service life cycle, including definition, 
validation, installation, anil management. 

Performance and Determinism. The round-trip real-time budget 
from a switch to a network element such as a service control 
point is typically quoted as 2"i() milliseconds (mean) with 
about 95% of responses within 500 ms. This includes the 
switch processing time, the transmission and queuing times 
Of both the request and response messages, and the sen ice 
control point processing time (encoding and decoding of 
messages, service activation and execution, query lo data- 
base). Clearly, the fasler the transmission links and the 
smaller the buffering in the system. Ihe more lime is avail 
able for service control point processing. For a simple free- 
phone service, once the SS7 transmission times are sub- 
tracted, we obtain a requirement for a mean service control 
point processing lime Qf50 ms With 95% completing within 
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Standardization — A Phased Approach 



Because of the complexity of intelligent networks, the number of unre- 
solved technical issues, and me significant financial investments the 
development of an intelligent network architecture supporting all pos- 
sible telecommunications services and technologies, called the target 
intelligent network, will take many years Standardization bodies have 
cnDsen to adopt a phased approach to intelligent network development 
that takes advantage of tire technological capabilities at a given point >n 
time and that guarantees backward compatibility between the different 
phases 

The International Telecommunications Union— Telecommunications 
Standardization Sector (ITU-T) has addressed this phased approach in its 
Recommendation Q.I 211 

Table I shows the different phases in terms of capability sets and their 
descriptions Each capability set gives a set of definitions of capabilities 
that are ol direct use to both manufacturers and network operators 

Table I 

Phased Approach to the Target Intelligent Network 

Phase ITU-T 
(Capability Recom- Time 

Set) mendation Frame Description 

CS1 Q.12U 



Finalized in First standardized stage 
1995 



CS2 
CS3 



0.1 22x To be final- CS1 -compatible Handling 
ized in 1997 of multiparty calls 

Q 123x Work CS2-compatible Handling 

started at of broadband aspects and 
end of 1 996 integration with the TMN 

CSn ... ... Evolving towards target 

intelligent network 

IMN • lotamimuiiii:atii>ns Management Network 
Capability Set 1 (CS1) 

CS1 is the first standardized stage ol intelligent network evolution based 
upon llie existing technology It is a subset ot the large! intelligent net 
work architecture CS1 defines the functional entities Isee Fig 2 on 
page 471 and Hie interlace between these entities It also defines the 
generic model of two-party call processing functionality, the Basic Call 
State Model IBCSMI CS1 limits end-user access to service processing 
capabilities to the following types analog lines. ISDN basic and primary 
rate interface (BRI/PRI). and analog and SS7 trunks 

The target set of services for CS1 includes universal personal telecom- 
munication (UPT), freephone, virtual private network (VPN), credit card 
calling, user-defined routing, and others All of these services are consid- 
ered immediately marketable and highly profitable The common charac- 
teristic of all CS1 services is that they apply only to one party of the call 
(either the originating or the terminating party), and generally only during 
the call setup phase. 



The protocol used by the different CS1 functional entities to communi- 
cate a called the Intelligent Network Application Protocol (INAP| This 
protocol relies on existing underlying transport protocols le.g . SS7/TCAP) 
to convey the intelligent network application layer protocol messages 

Capability Set2(CS2) 

CS2 is the second standardization stage and is a superset of the CS1 
recommendations. CS2 aims to support enhanced services m addition 
to the ones supported by CSV It introduces new capabilities that allow 
handling of multiple parties that are or will be involved in the same call 
such as conference calling Other capabilities will be included in CS2 to 
support personal mobility (UPT) and terminal mobility iDECT, GSM) func- 
tionality These new enhancements and capabilities are achievable by 
extending the existing CS1 call processing model and functional model 
INAP operations are also extended and new ones are to be defined 
Standardization activities are going on at ITU-T and ETSI (European Tele- 
communications Standards Institute) A complete revision of the CS? 
protocol is expected at the end of 1997 

The target set of services for CS2 includes call completion to busy sub- 
scriber, conference calling, call transfer, call waiting, mobility services 
(UPT. GSM), and others The common characteristic ol all lliese services 
is that they require call party handling functionality that is not supported 
in CSl 

Future Capability Sets 

CSl and CS2 do not cover all possible user accesses and network capa- 
bilities According to the phased approach. ITU-T plans to introduce CS3 
(and maybe others later) to cover broadband network aspects [intelligent 
network/B-ISDN integration), intelligent network/TMN integration, and 
full support of mobile communications systems Requirements are being 
set up and specifications might come in 1998 

HP Approach 

Because Ihe needs, in terms of network operation, vary from one network 
operator to anothet (operator-specific charging and billing, implementa- 
tion limitations), and because the INAP standard will continue to evolve 
following the different capability sets, network equipment providers have 
to work with a large number of INAP variants. 

To meet this need and to be able to respond to its customers' require- 
ments rapidly. HP has developed a flexible service execution platform 
(see the accompanying article) that is able to rapidly follow the evolution 
of the INAP and support different customers' specific variants Tools aie 
provided to automate the support of a message set's syntax The imple- 
mentation of Ihe message set's semantics has been pushed to the appli- 
cation level, leaving the platform itself independent of any supported 
message set This has the benefit that a single platform can be main- 
tained for a varied and evolving customer base This independence from 
the INAP message set also allows the HP platform to easily support 
similar message sets defined by other standards bodies such as MAP 
(defined by ETSI for mobile networks) and AIN 0 1 and 0.2 (defined by 
BellCore) 



02 ins The behavior must be controllable so thai ih< - system 
is deterministic. Transaction rates of up to 10,000 transac- 
tions per second for network elements have been requested. 

High Availability. Nctwoi i< elements such as service control 
points and h e location registers are critical components 



in an intelligent network Very high system availability is 
required: no inure than three minutes total downtime pet 
year, including both scheduled and unscheduled downtime. 
This necessitates high!) reliable hardware with no single 
point <>( failure and software thai allows mated applications 
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to back each other up in the event of a natural disaster 
disabling one particular site. 

Furthermore, in the event of a failure at a site, during the 
failure recovery phase, the network element must be re- 
sponsive to other network elements, taking less than six 
seconds to resume service. If not , the network considers 
thai a total failure has occurred at the site. 

The availability requirements also pervade the scalability 
and functional evolution aspects. The system must be capa- 
ble of expansion and addition of new functionality without 
disruption of service. 

Similar kinds of requirements apply to service management 
systems, albeit not as severe as for network elements. Ser- 
vice management systems are typically allowed 30 minutes 
of total downtime per year. 

Scalability. A network element must be scalable in terms of 
processing power, memory, persistent data storage, and 
communications without compromising system availability. 
On the hardware side, this means support for online up- 
grades of processor boards, memory boards, disk subsys- 
tems, communication controllers, and links. It must be 
possible to perform such operations without service inter- 
ruption. On the software side. it. means bringing additional 
resources into service smoothly and safely. If anything goes 
wrong, the software must automatically fall back to the last 
operational configuration. 

In general, network elements such as service control points 
and home location registers are classified in terms of trans- 
actions per second (TPS) and customer database size. Scal- 
ability then translates into the ability to add new hardware 
and/or software elements to increase either the maximum 
supported TPS rate or the maximum customer database size. 

Functional Evolution. The ability to add new applications or 
platform capabilities or simply upgrade existing ones with- 
out impacting the availability of the system is vital. This 
means that such things as updating the operating system, 
upgrading the application, adding a new communication 
protocol stack, or changing the firmware on a communica- 
tion controller must be achieved without disrupting real- 
time performance or system availability. This ability should 
not impose too many constraints on software development 
and installation. In all upgrade situations, a fallback must be 
possible. 

Manageability and Operation Requirements. There are detailed 
and rigorous requirements concerning installability, physical 
characteristics, safety, electromagnetic and electrical envi- 
ronments, maintenance, reliability, and so on. Remote man- 
agement interfaces to large and complex network systems 
with demanding performance and scalability requirements 
are needed. 

To allow easy management, complex distributed or replica- 
ted network elements must be capable of providing a single- 
system view to operations centers. At the same time, it is 
also very important to provide per-system views, to allow 
management of specific systems and to act upon them ( typi- 
cally for fault management or performance management). 
Newly installed network elements often need to be inte- 
grated into existing management systems. 



HP OpenCall Platforms 

HP OpenCall Service Execution Platform 

The HP OpenCall service execution platform is an open, 
scalable, programmable, highly available, and easily manage- 
able platform. It is implemented as a layer of functionality on 
top of the HP OpenCall SS7 platform (see article, page 58). 
Given the general network element requirements listed 
above, the following architectural principles and high-level 
design decisions were adopted when developing the HP 
OpenCall service execution platform. 

The software nms on the HP-UX* operating system, allowing 
it to benefit immediately from advances in hardware speed 
and CPU processing power. 

All critical hardware and software components are repli- 
cated, giving the platform the ability to tolerate any single 
failure. An active/standby replication model was chosen at 
the software level, with an instance of the platform software 
running on two independent systems. Besides providing a 
high degree of fault tolerance, such an approach also pro- 
vides the basis for most online upgradability tasks, such as 
upgrading hardware and the operating system. 

Fig. 3 shows a typical site configuration, with two instances 
of the HP OpenCall service execution platform software 
executing on two independent HP 9000 machines, but with 
a single connection to the SS7 network. This site will appear 
as a single network node to other elements of the SS7 net- 
work. The configuration in Fig. 3 is a standard duplex con- 
figuration of the platform. Other configurations are possible, 
such as the mated-pair configuration, described later. In the 
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Fig. 3. Duplex configuration of the HP OpenCall service execution 
platform. Two instances of the platform execute on two indepen- 
dent HP 9000 host machines with a single connection to the SS7 
network. 
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standard duplex configuration, the two machines are con- 
nected via a dual LAN and both hosts are connected to a set 
of signaling interface units. The HP OpenCall service execu- 
tion platform software runs on both hosts in active and 
standby modes. The active host controls the signaling inter- 
face units and responds to all incoming requests from the 
SS7 network. Both hosts are capable of being active (Le., 
there is no default active host). 

The platform is network independent It makes no assump- 
tion about the structure of the SS" network. It has the ability 
to support multiple message sets, but all message set depen- 
dent decisions are made at the application level. 

Some customization is necessary before an instance of the 
HP OpenCall service execution platform software can be 
installed in or connected to an SS7 network. By default, the 
platform supports no message set. and it offers no service 
to other network elements. Minimally, users of the platform 
must install one or more message sets and provide one or 
more applications to respond to requests coming from other 
network elements or to send requests to other network ele- 
ments. Furthermore, to ensure that the resulting solution 
can be monitored and managed, it should be integrated into 
an existing management system or a set of management 
tools should be implemented. 

A set of APIs (application programming interfaces) are pro- 
vided to access platform functionality, to be used either 
locally or remotely. This provides flexibility with respect 
to integration with operations support systems and legacy 
management systems. A set of management tools using 
these APIs are also provided, and can be used to provide 
a first level of platform management Further levels of man- 
agement (for managing the platform, installed services, cus- 
tomer data, etc.) can be provided by integrating the platform 
with other external management systems via the provided 
APIs. 

Applications, or snvicrs, executing on the platform are inter- 
preted. New services or new versions of existing services can 
be installed at run time without interrupting processing of 
T< IAP (Transaction Capabilities Application Part ) traffic and 
without affecting other running services. Because services 
execute in a virtual machine with no direct access to the 
operating system, sen il es cannot affect the availability of 
the platform. Furthermore, the service execution environ- 
ment can monitor service instances, ensuring that instances 
do not interfere and that resources are not permanently 
consumed. 

Services are independent of the operating system, protecting 
then from changes and upgrades to operating system. No 
knowledge of the operating system is required to write the 
service. Services are written in SLEL (Service Logic Execu- 
tion Language). Most of the basic concepts in SLEL have 
been adapted from SDL (Specification and Description Lan- 
guage), enhancing it with some features specific to intelli- 
gent networks. This has the advantage that SDL is well- 
known in the telecommunications industry, anil many 
teleci immunications standards are specified in SDL. 

A replicated in-mcmory relational database is provided as 
part Of the service execution environment. The structure 
and Contents of this database are under the user's control. 
By holding all customer-related data in RAM, services can 



respect the real-lime response time requirements imposed 
by switches, since there is no disk access to retrieve call- 
related data. To achieve data persistency, a Copy of the data- 
base is maintained on a standby host. 

A Management Information Base (MIB) collects information 
on the state of the platform, making this information avail- 
able both via an API and via a set of management tools, and 
allows external applications to manage and configure the 
platform. All management operations are directed to the 
active system — the standby system replays all management 
commands — thus presenting a single-system view to external 
applications. 

The platform is implemented as a set of I'NIX ' operating 
system processes, allowing it to profit from multiprocessor 
hardware. 

IIP OpenCall Service Creation Environment 

The HP OpenCall serv ice creation environment allows easy 
definition, validation, and testing of services. Services are 
defined as finite-state machines, using a graphical language. 
The service creation environment provides a set of tools to 
allow the validation and simulation of services before they 
are deployed on the HP OpenCall service execution platform. 
The same service execution environment as described above 
exists in the service creation environment to ensure that the 
same behavior is observed when the service is installed in 
the HP OpenCall service execution platform. 

HP OpenCall Service Management Platform 

The HP OpenCall service management platform is capable 
of managing multiple IIP < tpenCall service execution plat- 
form sites. Such a distributed configuration introduces an 
extra degree of scalability, both in terms of transaction 
throughput capacity and database capacity. 

Platform Design 

This section discusses five key aspects of the HP ( )pen( 'all 
service execution platform solution: the service execution 
environment, platform flexibility, high availability, database 
replication, and scalability. 

Service Execution Environment 

The HP OpenCall service execution platform provides ;ui 
execution environment for telecommunications services. 
These services are usually developed in response to new 
telecommunications requirements, and typically provide 
additional functionality to other elements in the SS7 
network. 

Programs defined in the Service Logic Execution language 
(SLEL) are interpreted by the platform at run lime, and can 
use language primitiv es to access the functionality of the 
underlying platform. The primitives enable them to send and 
receive TCAP messages, read and write message attributes, 
access and update the in-mcmory database, communicate 
over other external connections, send and receive events, 
set and reset timers, manage shared resources, log data to 
disk, and perform other functions. 

Programs written in SLEL define finite-stale machines, thai 
is, a service wails in any given stale until a recognizable 
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Service Logic Execution Language 
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TCAP = Transaction Capabilities Application Part 

Fig. 4. Services art.- mitten as finite-state machines in lite Service 
Logic Execution Language (SLEL) and run on the SLEL virtual 
machine, which provides the functionality shown here. 

iXI]>Ul SttaMllate8 a transition to the next state. During the 
transition the service can carry out processing and output 
messages. Fig. -1 summarizes the functionality of the SLEL 
virtual machine. 

The following principles have been adopted in the develop- 
ment of the HP OpenCaU service execution environment: 

• Services are not aware of the operating system. They are 
isolated from the HP-1 'X interface. This allows easy migra- 
tion between HP-UX versions. Furthermore, die application 
developer only needs to provide service-specific logic and 
need not lie concerned with the startup, switchover, and 
failure recovery aspects of the platform, since these are 
transparent to the service logic. 

• Services are not aware of replication. Replication of services 
and restart of services after a failure are handled automati- 
cally by the platform. No failure recovery code needs to be 
written by the service developer. The service programmer 
can assume a single-system view. To maintain this illusion, 
services can only access the local MII3 and can only receive 
locally generated events. Furthermore, the service execution 
environment on the standby node is an exact replica of the 
active node's, providing the same services, same resources, 
same MB structure, and so on. 

• Real-time response to switches. The service execution 
environment gives the highest priority to the processing 
of TCAI' messages and other service-related events (e.g., 
popped timers, received events ). All other activities such as 
database or M1B accesses by external applications are run as 
background tasks. Service execution caimot be interrupted. 
A single state transition runs to completion. All local access 
by a service is synchronous (even to the database). A service 
only blocks if it explicitly requests bloc-king, for example to 
wait for the next TCAP message, wait for a timer, or wait for 
an event. 

• Services cannot crash the platform. The service execution 
environment presents a virtual machine as its upper inter- 
face. Services can only access the platform functionality 
from SLEL. No pointer manipulation is available. Resource 
allocation and deallocation are done automatically by the 
platform. There is no possibility for core dumps or memory 
leaks. 



Online upgradahility of services is possible. Because services 
are interpreted, services can be enabled, disabled, installed, 
or removed at run time without stopping the platform. Mul- 
tiple versions of a service can be installed simultaneously, 
although only one can be enabled at any time. Instances of 
the previously enabled version are allowed to run to com- 
pletion, so that TCAP traffic is not interrupted. 
The platform manages and monitors service instances. A 
single service cannot block other services from executing. 
There is a limit on the number of instructions that can be 
executed before a service is terminated. This prevents infi- 
nite loops and prevents one service from blocking out all 
other services. Resources are automatically reclaimed once 
the service instance has completed. There is also a limit on 
the total lifetime of a service, as a way of garbage collecting 
unwanted service instances. Both limits can be set on a per- 
service basis, and can be altered at run time. 

Platform Flexibility 

There is a strong requirement for the IIP OpenCaU service 
execution platform to be flexible. Obviously, it should make 
as few assumptions as possible about the applications that 
are installed and executing on it. and it should be able to 
integrate into any central office environment, both in terms 
of its connection with the SS7 network and its links with 
management applications. 

Multiple Message Sets. The platform by default supports both 
a TCAP/SS7 connection and a TCP/IP connection. It makes 
no assumption about the protocols thai are used above 
these well-standardized layers. In practice, any message set 
defined in ASN.l (Abstract Syntax Notation ( )ne) can be 
loaded into the platform, and multiple message sets can be 
supported. The platform can encode and decode messages 
belonging to one of the installed message sets. 

The message set customization tools take as input an anno- 
tated ASN.l definition of a message set. The output is a 
message st'l informal ion base, which contains a concise 
definition of the message set. Multiple message set defini- 
tions can be stored in a single message set information base. 
The HP OpenCaU service execution platform's service exe- 
cution environment loads the message set definitions from 
the message set information base at startup time. Services 
running in the execution environment can request the plat- 
form's encode/decode engine to process any incoming or 
outgoing message with respect to the installed message sets. 

The same message set information base can be loaded into 
the HP OpenCaU service creation environment. The message 
set definitions are available to service developers as pan of 
the help facility. The message set information base is also 
available in the validation environment, allowing the user to 
submit message-set-specific message sequences to test the 
logic flow of the developed services. The traffic simulation 
tool uses the message set information base to encode and 
decode the user-supplied message sequences. 

Flexible Service Instantiation. When a new i t AP transai lion 
request is received by the platform, service instances must 
be created and executed to process the message. To offer a 
high degree of flexibility, the policy for choosing the correct 
service instances is programmable, that is, a user-supplied 
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piece of service logic is executed to choose the most appro- 
priate •service. Tin- decision can be based on any criterion. 

such as priority of the request, customergpetffic dan hdd 

in the dalal>ase. overload status of the platform, and so on. 

To allow tracking of resources, i ml) i >ne service iitslance 
controls Ihc TCAP transaction, although it can l>e passed 
between Instances. In this way. if the instance thai currently 
Controls a transaction exits unexpectedly, the platform can 
dost 1 the associated transactions and free the associated 
resources. 

Flexible Service Structure. The intelligent network market has 
traditionally taken a service independent building block- 
based approach to service implementation. That is, aset 
of serv ice independent building blocks are provided bj the 
Underlying execution platform, allowing applications to 
merely link these building blocks together to provide ser- 
vices to the SS7 network. The disadvantage of this approach 
Ls that I he only functionality available to programmers is the 
set of av ailable service independent building blocks. These 
are often message-set -specific and difficult to customize. 

The IIP opent "all service execution platform does not pro- 
vide a default set of service independent building blocks. 
Instead, it prov ides the means to structure applications ;is a 
set of components. Thus, if required, a user can implement 
the ITI -defined set of standard service independent building 
blocks and then use those components to implement appli- 
cations to prov ide higher-level services. Of course, the user 
can also decide to implement an entirely different set of 
service independent building blocks. 

Furthermore, the platform does not distinguish between a 

service independent building block (or component ) and an 
application. Instead, il views both as services. Both can be 
arbitrarily complex pieces of logic, defined and installed by 
the user. How they interact and what functionality they pro- 
vide are entirely under the user's control. To provide a single 
service to the SS7 network might only involve a single in- 
stance Of service logic, at li might involve the creation anil 
inlerworking of multiple such instances. 

Platform Management. The platform exports information on 
its state via a Management Information Base (Mil!). The 
Mil! can be used to both monitor and control the platform. 
For example installing a new service or a new version of an 
existing service onto the platform is performed via the MIB. 
Similarly, adding a new T( ' I VI I * connection or supporting a 
new subsystem number is also achieved via the MIB. Both 
cases involve creating a new object in the MIB. Statistics on 
('IT use, TCAP traffic, service execution, database iiicniorv 
use. and so on are all held in the MIB and can be retrieved bj 
external applical ions by issuing a request to the appropriate 

objects, 

The information is presented as a hierarchy of objects, with 
each object representing a pan of the platform's functionality. 

The hierarchical Structure provides an intuitive naming 
scheme, and this also allows easy integration into standard 
(\MIS(( omiuon Management Informal ion Service) or SNMI' 
(Simple Network Management Protocol) management 
frameworks. I scrs can create new objects, delete existing 
Objects (obviously changing the functionality of the platform 
in the process), update existing objects, or jusl retrieve in- 
formation from individual objects. As mentioned previously. 



APIs that can be used remotely are provided, along with a 
set of management tools that provide a first level of plat- 
form management. 

Customized Overload Policy. < »ne of the most important respon- 
sibilities of any SS? network element is to respond in a timely 
manner to incoming requests The response lime on requests 
mUSl appear 10 be bounded, and Q9N of replies must be gen- 
erated within a fixed time interval. This implies that the net- 
work element mUSt react to overload situations. 

With the HP Opent all sen ice execution platform, the over- 
load policy is programmable. Thai is, user-defined logic 
specifics bow the network element reads when the load Ls 
liigh. The platform itself collects statistics on a number of 
overload indicators such as ( IT use. transaction rate, num- 
ber of unprocessed messages, and average queuing lime for 
such messages. These values are available in the MIB and 
can be viewed both by the overload service (the logic imple- 
menting the overload policy ) at id b> external management 

applications, 

The progranunabilily of the ov erload policy offers a high 
level Of flexibility. For example, under heavy load, the ov er- 
load sen ice may decide to: 

• Reject new incoming requests bill continue to accept 
messages relating to ongoing transactions. 

• Bequest thai other nel work elements reduce the number Of 
requests. Obviously such a policy is network-specific and 
message-scl -specific, requiring knowledge of both the SS7 
network configuration and the message sets supported by 
remote switches. 

• Reject new requests thai require complex processing 
(obviously application-specific), or reject requests for low- 
rcveniie-generaiing services (again, application-specific). 

The platform also prov ides a set of hooks lo control the traf- 
fic How. The overload policy can. for example, request lite 
platform to reject new transaction requests (i.e.. only accept 
messages relating lo ongoing transactions), to limit the num- 
ber of instances of a given service, or to reject traffic from a 
particular remote network element 

Nigh Availability- 
All critical components of the IIP < (penCall service execuiion 
platform are replicated (see Fig. -">). The core sel of software 
processes operate in aclivc/standbv mode. This forms the 
basis both of the platform fault tolerance and of its online 
upgrndahilit) policy. It uses the IIP < (pent all SS7 high avail- 
ability platform, with every critical process being a client of 
the fault tolerance controller (see article, page ti'i). For sim- 
plicity, not all of the processes are shown, and not all of the 
interprocess links are shown. 

Besides replication of processes, Ihc platform also uses 
mirrored disks, duplicate signaling Interface units, and dual 
LAN connections. 

The principles that form the basis for Ihc high availability 
policy are discussed in Ihc following paragraphs. 

Active/Standby Model. \n instance of the I IP t (pent 'all service 
execution platform platform runs On each of two indepen- 
dent machines. One instance, the actirr, is responsible for 
responding lo all incoming requests, whether from the 
SS7 network or from management applications. The other 
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SLEE = Service Logic Execution Environment 

instance, the standby, maintains a copy of the active's state, 
always ready to take over processing of such requests. The 
decision to adopt an active/standby model brings the follow- 
ing benefits: 

The code is simpler, resulting in fewer run time errors. 
It is easy to provide a single-system view (that is, the active 
instance defines the state of the platform). 
It isolates errors because the two hosts are not performing 
the same tasks or the same types of tasks. 
The standby host is available for online upgrades, configura- 
tion changes, and so on. 

The alternative, to adopt a load-sharing model (with requests 
being processed in parallel on two or more hosts), would 
have required more complex communication protocols be- 
tween the instances (to ensure synchronization and a 
single-system view ) as well as a greater possibility of an 
error simultaneously impacting more than one host. 

High Service Availability. The solution guarantees high service 
availability to the SS7 network in the event of any single hard- 
ware or software failure. To achieve this, the standby host 
must always be ready to take over in the event of a failure 
on the active host. For this reason, the standby maintains a 
copy of the in-niemory database and of the service execution 
environment. All relevant changes to the state of the active 
(e.g., changes to the active database ) are immediately prop- 
agated to the standby. In the event of a switchover, the state 
of the standby instance defines the current state. That is. the 
standby does not need to retrieve information from either 
the failed instance or from other external applications to 
become active. Thus, the transition to the active state is 
instantaneous, guaranteeing high service availability. 

Centralized Fault Recovery Decisions. All switchover and pro- 
cess restart decisions are made by a centralized process, 
the fault tolerance controller, on each node. These two pro- 
cesses continuously exchange information on the state of the 
two hosts. All other processes obey the centralized decision- 



Fig. 5. Replication of critical 
components in the HP OpenCall 
service execution platform. 

maker. This greatly simplifies failure recovery and error 
handling. In I he event that both LANs go down, the signaling 
interface units are used as a tiebreaker. The node that con- 
trols the signaling interface units remains active (see article, 
page (35). 

Online Upgradability and Maintenance. Because all of the criti- 
cal components are replicated, the existence of the standby 
host is used as a basis for all online upgradability operations, 
such as changing the operating system, installing new ver- 
sions of the platform, reconfiguring services or the service 
execution environment, performing rollback operations on 
the database, and so on. Because most upgrade operations 
can effectively be performed online in this way, the platform 
meets its downtime-per-year requirements. 

Replication Does not Impact Services. Service programmers do 
not need to be aware of the replication of the platform. Fur- 
thermore, propagation of data to the standby is performed as 
a background task, implying a minimal impact on response 
time. Algorithms for resynchronizing the standby after a 
failure are also run in the background. 

Respawnability. ll is important that a failed host or a failed 
process be restarted quickly. If the standby host is not avail- 
able, this introduces a window of vulnerability during which 
a second failure could cause the whole network element to 
be unavailable, directly impacting service availability. This 
respawnability feature is possible because of the continuous 
availability of an active system. The failed instance will 
restart, rebuilding itself as a copy of its peer. Because the 
active instance is assumed to be consistent, the respawned 
instance is also guaranteed to be consistent. However, to 
avoid rebuilding an instance to a noneonsistenl state, in- 
stances are not respawned if the peer is not active (which 
might happen in the rare case of a double failure ). In such 
cases, manual intervention is required to restart the com- 
plete platform. This respawTt policy ensures rapid failure 
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recover, in the case of a single failure but prevents erro- 
neous failure recovery in the case of a double failure. 

Support for Complete Restart. Although the platform is de- 
signed to tolerate any single failure, in certain cases it may 
he necessary to stop and restart both hosts (either because 
of a double failure or for operational reasons). Disk-based 
copies of both the MIB and the fan-memory database are 
maintained, and these can be used to rebuild both hosts. 
However, some data loss may occur. This is considered to 
be acceptable, since double failures are rare. 

Database Replication 

At the core of the service execution environment is an in- 
memory replicated database. The database is in memory 
to guarantee rapid access and update times. To ensure high 
availability, copies of the in-memory database are kept on 
both the ac tive host and the standby host. Critical updates 
to the active database are propagated to the standby. 

The structure and contents of this database are under the 
user's control. When defining the database structure, the 
user must also define the replication policy for every' field. 
Of course, propagating updates to the standby will impact 
performance, but because the user is allowed to specify the 
replication polic y, the user can also control the impact. 

Traditionally, database systems achieve persistency by main- 
taining a copy of the database on disk or on mirrored disks. 
In the HP OpenCall service execution platform, the primary 
standby copy is held in memory on a standby host. This 
offers a number of advantages over the traditional approach: 
Writing to a remote in-memory copy is quicker than logging 
to disk, and therefore has a smaller impact on SS7 traffic. 
The degree of availability is higher. In the event of a failure 
OK the active host, the standby copy is immediately avail- 
able. It is not necessary to recreate a copy of the database. 
The standby copy can be taken offline and used to restruc- 
ture the database or to roll back to a previous version of the 
database. 

Periodically, the active host generates a disk-based copy of 
the database. This checkpoint of the database serves a 
number of purposes: 

The checkpoint ensures thai the platform can recover from 
double failures. 

The checkpoint is used to reinitialize a .standby host if and 
when it is restarted after a failure or shutdown. 
The checkpoint can be used for auditing purposes by 
external database management systems. 

Three algorithms are critical to the database replication 
scheme: the data replication algorithm, the ^synchroniza- 
tion algorithm, and the rollback/restore algorithm. 

Data Replication Algorithm. As mentioned above, the user spec- 
ifies the replication policy for every field in the database. 
For certain data in the database, it may not be necessary to 
replicate changes to the standby host. Typical examples are 
counters or flags used by services. 

Consider a set Of fields that collect statistics on SS7 traffic. 
These would typically be Incremented every lime that a new 
request is received, and would be reset to default values 
periodically. Por many services, it may be acceptable not to 
pn ipagate these traffic statistics to the standby database. This 



implies that some data will be lost in the event of a failure 
on the active host (all traffic statistics since the last reset), 
but it also implies that less CPU time is consumed replicating 
this data to the standby. This trade-off is application-specific, 
so the decision to replicate is left to the user. 

For fields that are marked for replication, the active database 
instance will generate an update record called an external 
update notification, containing both the old value and the 
new value, for every update. This record is then propagated 
to the standby database instance. By comparing the old value 
with the new value, the standby database can verify that it is 
consistent with the active. In the case of an inconsistency, 
the standby database shuts itself down. It can then be 
restarted, rebuilding itself as a copy of the active database. 

The flow of an external update notification is shown Fig. 6. 

To handle double failures, the external update notification 
is also written to disk on both the active and standby hosts. 
This is performed as a background task on the active host to 
minimize the impact on the transaction processing activity. 
The active host does not attempt to maintain a replica copy 
of the database on disk. Instead, a log is maintained of all 
of the updates that have been performed. To rebuild the in- 
memory database, the log must be replayed. 

To manage the disk space required for these external update 
notification logs, the active host takes regular checkpoints. 
This task is treated as a low-priority background activity, 
and the frequency of the checkpoints can be controlled by 
the user (obviously as a function of the available disk space). 
When a checkpoint operation has completed, redundant 
external update notification log files can be removed. By 
default, the active host keeps the two most recent check- 
point files and intermediate external update notification log 
files on disk (following the principle of replicating all critical 
components). 

Because database updates must be allowed during the check- 
pointing operation, eac h checkpoint file has an associated 
sequence of external update notifications that were gBUWljlKl 
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Fig. f>. External update notification BOW, ( I ) Update performed 
on active database. (2) Update record (external update notifira- 
lion ) sent to standby. (3) External update notification logger! to 
digit on airtive host. (•!) Update performed on standby database. 
(5) External update notification logged to disk on standby host, 
(fi) Acknowledgment Sent to active host. (7) Exlernal update 
notification forwarded to interested external applications. 
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2. The standbs liosi rebuilds I he in-tucinory copy of the data- 
base from this consistent set of files. 
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Fig. 7. Checkpoint and uxtctrna! update notification (Dos mi disk. 

while the checkpoint was being taken. 3b rebuild a copy of 
the in-memoty database, both the checkpoint file ami the 
associated external update notifications ate required. Typi- 
rally. I hi' contents <if I lie disk containing these database- 
related Hies will he as depicted in I'ig. 7. The external up- 
date notifications are stored in a sequence of files. The rate 
at whicli one file is closed and a new file in the sequence is 
opened can he controlled hy the user (either as a function 
of time or file size, or on demand). Thus, multiple exlenial 
update notification log files can he Created helween two 
checkpoints, or indeed during a Checkpoint operation. A 

copy of the m-memory database can he rebuilt either from a 
checkpoint file and associated external update notification 
log files (those files generated dining the checkpoint ) or 
from a checkpoint file, associated external update notifica- 
tion log files, and subsequent external update notification 
log files. To hold a copy nf ihe database on a remote storage 
device, the user should close the current external update 
notification log file (Ihe active database will close ihe cur- 
rent tile, assign it a file name in the sequence, and open a 
new current external update notification log file), and the 
user should then copy to stable storage the most recent 
checkpoint file and Ihe external Update notification log files 
generated during and since that operation. 

Resynchronization Algorithm. Failures will happen. Therefore, 
a failure recovery algorithm is required. * >ne c ritical compo- 
nent is the recovery of Ihe standby cony of the in-niemory 
database. The checkpoint and external update notification 
log files also play a critical role in this algorithm. The algo- 
rithm is complex because updates are permitted on the 
active database while Ihe standby database is being rebuilt. 
The resynchronization process can take a long lime (in the 
order of minutes I, so it would not be acceptable to disallow 
database Updates during that period. 

The sequence of steps In resynchronization is as follows: 

I. The standby host copies the most recenl database check- 
point file and external update notification log files from the 
active file system to its local file system. 



3. When this operation is complete, the standby database 
asks the active database to close the current external 
update notification log file. It then retrieves that file and 
replays Ihe external update notification records. 

I. Step : i is then repealed. The assumption is that, because 
Ihe standby host is not performing any other tasks, with 
each iteration of step .'t. the size of the current external up- 
date notification log file will be reduced. In effect, the stale 
of Ihe standby database is moving closer to that of the active 
database. Eventually the size of the current external update 
notification log Die will be small enough for the algorithm to 
move lo Ihe next step. 

5. The standby database again asks Ihe active database to 
close Ihe current external update notification log file. Al litis 
point, a connection is also established between the two 
database copies. Now. while retrieving and proc essing the 
latest external update notification log file. Ihe standby data- 
base is also receiving new external Update notifications via 

ihe socket connection. 

li. Once Ihe latest external update notification log file is 
processed, the standby database stalls to replay Ihe queued 
external update notifications. At this point, with the estab- 
lishment of the real-lime connection between Ihe Iwo data- 
base copies, the Iwo copies are considered to be synchro- 
nized, and Ihe standby database is considered to be lint 

standby, 

Rollback/Restore Algorithm. As with all database systems, it is 
also possible to roll the in-niemory database back lo a pre- 
vious slate. Again, the checkpoint and external update noti- 
fication log files play a critical role in this algorithm. 

This operation Can be achieved "online" by using Ihe standby 
copy of the database, which can be taken Offline withoul 
impacting the active host. While the latter continues to pro- 
cess T( 'AP traffic, the rollback algorithm can be applied to 
the standby database. In effect, it rebuilds itself from a check- 
point tile and associated external Update notification log files. 
Once this operation is completed. Ihe user can request a 
switchover of the Iwo hosts. The previously offline standby 
host will now begin lo receive and lo process TCAP traffic. 
The peer host should then be rebuilt as a copy of this new 
active host, using Ihe synchronization algorithm described 
above. 

Scalability 

The III' ( >pcn('all service execution platform is implemented 
as a set of processes running on the HP-UX operating system. 
This gives it a degree of scalability, since it can run on a range 
of HP machines and can also benefit from multiprocessor 
hardware. Low-end configurations are executed on a single- 
processor. 48-MHz machine With 12NM byles of RAM. The 
high-end configurations currently execute on a dual-proces- 
sor. 96-MIIz machine with 7HSM byles of RAM. The migration 
to HP-UX 10.20 increases the capacity of the platform both 
in terms of the maximum supported TPS rate and the maxi- 
mum database size. 
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An extra degree of scalability is provided with the sniiporl 

of mated-pair configurations. Fig. 3 showed the standard 

single-site duplex configuration, that is. an active/standby 
configuration running at one geographical location with a 
single connection to the SST network (multiple links are 
used lor redundancy and load sharing, hut a single address 
is shared by these links). A distributed solution, with multiple 
active/standby configurations running at a set of sites (each 
with its own SST address) provides a number of benefits: 
Extra TPS capacity, since sites can process traffic in parallel 
Increased database capacity 

Tolerance of site failures or outages. This is critical if and 
when it is necessary to shut down a complete site for main- 
tenance or operational reasons (or in I he unlikely case of a 
double failure hitting both the activ e anil Standby hosts). 

The IIP ( ipenCall service management platform can be used 
to manage a distributed configuration, as shown in Fig. S. 
Although this is referred to as a mated-pair configuration, it 
is not limited to two sites: multiple sites can be supported. 
Furthermore, each site is not requited to have the same 

database, that is. the contents of the databases at different 
siies can be completely different. However, in general, sites 
w ill be paired: hence the name mated-pair. For a given sile, 
a copy of its database w ill also be maintained at one other, 
geographically remote site This remote sile then has the 
abilit] lo take over If the original site is shut down. The role 
of the IIP t >pen('all service management platform is to main 
lain consistency across these multiple sites. 

The centralized IIP < >pen('all sen ice management platform 
maintains a disk-based copy of each in-memoiy database. Il 
receives notifications (external update notifications) from 
each site when the in-ineuiory database is changed. Such 
external update notifications are then propagated to all other 
siies containing the altered data, 

If an operator or system administrator wishes lo change the 
contents of the in-memory database, the ret|ttesl should be 

sent to the service management platform, it will then forward 
the request i" all siies holding a copy of the altered 'lata. 

Sites w ill typically process TCAP traffic in parallel. SO it is 
possible thai the same data may be changed simultaneously 
al I wo separate siies. Both will generate external upilale 

notifications and both external update notifications will be 
propagated to the service management platform The plat- 
form is responsible for detecting this conflict and for ensur- 
ing the consistency of the replicas II uses the oltl value field 



of the external update notification to detect thai simulta- 
neous updates have Occurred. Consistency is ensured by 
rejecting any external update notification in which the old 
value does not match the current Value held by the service 
management platform, and by applying the current value at 
ihe site from which the erroneous external update notifica- 
tion was received. In effect. I his establishes the service man 
agement platform's database as the m;ister database. In Ihe 
event of a discrepancy, the service management plat form 
ensures that its view is enforced and thai all copies w ill 
become consistent. 

The IIP t lpen( all service management platform also supports 
an auditing function. The service management platform 
orders a checkpoint of the IIP < tpenCall service execution 
platform's in-memory database, and then compares the con- 
tents ol the resulting checkpoint antl external update notifi- 
cation log files with the contents of its own disk-based data- 
base. Discrepancies are reported to the system administrator. 

( (inclusion 

The intelligent network archil eel lire allows Ihe telecommu- 
nications industry lo move lo a more open solution, in 
which the creation, deploymcnl, anil modification of ser- 
vices is independent ofa particular network equipment sup- 
plier and equipment from different providers can ihterwor* 

to provide new and innovative services. Having independent 

and neutral Intelligent network platform providers can en- 
force implementation of standards, hence ensuring better 
inicrcoiineelivity. 

The IIP t Ipent all sen ice execution platform is an open and 
flexible platform. Il has been installed in a large number of 
different networks throughout the world, and has show n its 
ability to inierwoik with equipment from a host of oilier 
network equipment providers. It has been used to implement 
mission-critical network elements sm h as service control 
points and service data functions using Standard, ol'l'-i he-shelf 
computer technology. This use of standard hardware and a 
standard operating system will allow operators to benefit 
IV the evolution of information technology with few addi- 
tional costs and limited engineering risk. 

HP-UX 5" and If) D Ini HP 900(1 Senes 700 pad BOO computers ate XvOpen Company UNIX 93 
bandad products 

UNIX is a mtjistnred liadematk in the Uniled Status and ntlioi DOUMriH, licensed enclusrvely 
Ihrauuh X/Opcn Company I muled 

X/(!|ien is a imjislnred liailom.uk and lite X device is a trademark ol X/Opeti Company limited 
in Ihe UK anil iilliei COoHtfiK 
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The HP OpenCall SS7 Platform 



The HP OpenCall SS7 platform allows users to build computer-based 
signaling applications connected to the SS7 signaling network. 

by Denis Pierrot and Jean-Pierre Allegre 



Today's telecommunications operators need to offer more 
ami more services to their customers. Because of deregula- 
tion and the resulting competition, network operators have 
to be able to bring to the market useful value-added sen ices 
to differentiate themselves from the competition. To support 
new functionalities, telecommunications networks have un- 
dergone an important restructuring starling in the mid-1980s. 
This restructuring resulted in the separation of the signaling 
functions from the voice transmission functions. 

Signaling includes all of the necessary procedures to set up, 
tear down, and control calls. Before this split was made, the 
networks were using inband signaling — the signaling informa- 
tion was conveyed over the same channel as the voice with 
some predefined tones (see Fig. la). This technique had many 
drawbacks, including: 

Long call setup times. Addressing information needed to be 
outpulsed one digit at a time for each intermediate switch in 
the voice path. 

Security problems. Billing fraud was possible by faking the 
inband signaling and billing tones. 

Limitations on the amount of new services that could be 
provided. 
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Fig. 1. (a) Before signaling was separated from voice transmission, 
networks used inband signaling — the signaling Information was 
conveyed over the same channel as the voice, (b) After the split, all 
of the connection setup, teardown, and control is effected via the 
signaling network and the voice trunks are dedicated to transporting 
voice only. 



The Signaling Network 

With the separation of signaling and voice transmission, 
the concept of the signaling network was introduced. The 
signaling network is a digital, robust, packet network with 
built-in redimdancy to achieve a high degree of availability. 
Fig. lb shows the typical t opology. All of the connection 
setup, teardown, and control is effected via the signaling 
network and the voice trunks are dedicated to transporting 
voice only. 

The creation of the signaling network, often called common 
channel signaling or CCS, makes it possible to implement 
an important set of new services because of the global con- 
trol it provides over the transmission network. The current 
implementation of the signaling network is called Signaling 
System #7, or SS7. 

The signaling network is the foundation for the intelligent 
network, which makes it possible to deliver new services to 
net work operators' customers in a timely and cost-effective 
manner. The intelligent network is programmable so that new 
services can be easily provisioned. It uses vendor indepen- 
dent interfaces so that mullivendor networks can be built. It 
allows rapid introduction of new services, and it distributes 
the intelligence in the network into a few intelligent elements. 
For more information on intelligent networks, refer to the 
article on page 46. 

Elements of the Signaling Network 

The signaling network is a packet network built using the 
following elements: 

The service switching point is a switch that is able to 
interac t with the signaling network. 

The signaling transfer point is a packet switch that routes 
messages between end points of the SS7 network. Signaling 
transfer points are often compared to IP routers. Signaling 
transfer points have no connections to voice trunks or 
telephone lines. 

The service control point is the place of execution of value- 
added services. Historically, service control points were 
seen as databases only. The Advanced Intelligent Network 
architecture describes them more as the place of execution 
of the service logic. 

Signaling links are the physical connection between 
elements of the SS7 network. They provide a full-duplex 
64-kbit/s digital path conforming to the V.35, DSOA, or Tl/El 
standards. A group of signaling links connecting the same 
two elements can be grouped logically into a linkset. The 
SS7 protocol provides procedures for redundancy and load 
sharing between links of the same linkset. For example, if 



58 August 1997 Hewlett-Packard Journal 

© Copr. 1949-1998 Hewlett-Packard Co. 




7 




! 




4. 






sip 



A 




SSP = Service Swilching Point 
STP = Service Transfer Point 
SCP = Service Control Point 



Fig. 2. Elements of the signaling network. 

a link within a linkset tails, the protocol will automatically 
move the traffic to iIip nonaffected link and will try to 
restore the failed link. 

Fig. 2 shows typical elements of an SS7 network. The signal- 
ing transfer points are provisioned in pairs. Service control 
points and service switching points always connect to two 
different nodes. Similarly. links are redundant such that 
messages can always find an alternate route in case of a 
failure. 

Use of the Signaling Network 

The signaling network is used for many purposes: It is used 
for regular calls, allowing rapid setup and secure operation. 
It is used in the wireline fixed network to provide additional 
services requiring specific service logic and databases (800 
numbers, alternate billing, etc.). It is used in the mobile net- 
work to manage mobility information. For example, when 
a mobile phone is switched on, the home location register 
containing the subscriber profile is Queried using the signal- 
ing network. 

The signaling network is now the basic infrastructure for 
the global telecommunications network. SS7 networks are 
deployed in almost all countries now. with variable cover- 
age. North American networks are defined by ANSI and 
Bellcore, while the rest of the world usually follows the 
I'l l standard. The two flavors of the standard are similar, 
but (of course!) incompatible. The ITU version is used at 
the boundary of the international networks. 

lit.- SS7 Protocol Stack 

Tin' SS7 reference model is based on the < (pen Systems 
Interconnection (OSI) reference model of the International 
Organization for Standardization (ISO), following similar 
principles with layers of protocols. However, the SS7 model 
is more specialized, being designed for signaling information 
transfer with a specific focus on low latency and built-in 
robustness. 

Fig. :1 shows the SS7 protocol slack. MTP stands for Message 
Transfer Part. It represents the three lower layers of the pro- 
tocol stac k. The Signaling Connection Control Part (SCCP) 
is built on top of MTP Level 3. The- ISDN I ser Pari ( ISl H') 
sits on lop of MTP land potentially SCCP also). The Trans- 
action Capabilities Application Pari (TCAP) resides on top 
"f SCCP Let's look at eac h layer's functions. 



MTP Level 1. The physic al layer of the SS7 protocol is based 
on digital transmission channels known as signaling links, 
connecting two digital elements with a rate of 56 kbits/s 
(ANSI) or M kbits/s ( ITT " ). The physical network can be 
composed of V..I5. DSOA. orTl/El links. 

MTP Level 2. MTP Level 2 maps onto layer 2 or the OSI seven- 
layer model and provides a basic message exchange with an 
error correction mechanism based on the retransmission of 
unacknowledged messages. An alignment procedure ensures, 
if suc cessful, that links are able to convey messages l>etween 
two points. 

I'nlike most level 2 protocols. MTP 2 lias some unusual fea- 
tures. For example, it keeps filling the available bandwidth 
by sending small messages ( fill-in signaling units or FISI ' ). 
especially when there is no user traffic-. This allows it to 
promptly detect any physical link failure and to react 
accordingly. From an implementation point of view, this is a 
very' stressful feature and usually requires specific hardware 
and firmware. Another interesting feature of MTP 2 is its 
ability to return unacknowledged frames stored in its buffers 
to the upper layer (MTP 3) in case of failure. This allows 
MTP 3 to retrieve the frames from a failed link and send 
them again on another link without any data loss. 

MTP Level 3. MTP Level 3 handles the routing functions and 
network management procedures of the SS7 network MTP :) 
is the key contributor to the built-in robustness of the SS7 
network. The network management functions are the most 
complex features of t he SS7 protocol. They are in charge 
of maintaining the integrity of the signaling network. These 
functions can be split into three areas: link management, 
traffic management, and route management. 

Link management is responsible for the integrity of one link. 
It uses services (especially counters) provided by MTP 2 to 
monitor the quality of a link. If the link is considered to be 
in error (excessive error rate, for example), then the link 
is removed from service, messages are rerouted to alternate 
links, and the adjacent node is notified to do the same. MTP :J 
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Fig. 8. The SS7 protocol stack 
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Fig. 4. The SS7 network as seen tiy the Message Transfer Part 
Level 3 (MTP 3} on the local node. 

then starts an alignment procedure to try to restart I he link 
in a clean slate. 

Traffic management handles traffic on the links within a 
linkset. II is in charge of load-sharing the traffic over all 
the active links and of rerouting traffic from a failed link to 
another. 

Route management is in charge of maintaining information 
on the network topology and the availability or unavailability 
of certain paths to reach destination nodes. The interesting 
feature of MTP 3 is that it only knows adjacent nodes and 
destination nodes and not other intermediate nodes ( Fig. 4). 
It maintains a table of available routes to reach a destination 
node via an adjacent node. 

The other functions of MTP 3 are message routing, discrimi- 
nation, and distribution. On the outbound path, this means 
finding the right link to reach the target destination node. 
( )n the inbound path, it has to determine for which higher- 
level protocol the packet is intended (ISUP. SCCP, etc.). It 
can reroute the packet via the network if it determines that 
it is not the intended target. 

SCCP. SCCP is built on top of the MTP 3 layer and provides 
additional end-lo-end services such as connectionless or 
connection-oriented service, extended addressing, and net- 
work management functions. 

SCCP Users are assigned a specific address called a subsys- 
tem number which, along with the MTP 3 address (called a 
point co(Ip), makes it possible to uniquely address an SCCP 
user in the SS7 network. The extended addressing feature of 
SCCP allows the use of a label or global lillc in place of the 
subsystem number and point code to address an application. 
This allows for symbolic addressing and provides a level of 
indirection with respect to the physical structure of the SS7 
network. The translation of the global title to the subsystem 
number and point code is accomplished either in the signal- 
ing transfer points or in the endpoints. Very often, the dialed 
digits are used as the global title. 

In connectionless mode. SCCP operates a bit like t'DP (User 
Datagram Protocol) operates in the Internet world: messages 
are sent to a target address (either a global title or a subsys- 
tem number and point code) and are transmitted from node 
to node by MTP 3 to the final destination. There is no guar- 
antee of delivery of the message, nor is it guaranteed that 
the messages will arrive in the order in which they 



were sent. The connectionless mode is the most widely 
used, especially because TCAP uses it. 

In connection-oriented mode, SCCP operates a bit like X.25. 
A virtual circuit must first be opened before data transfer 
can take place. Once the circuit is open, there is a guarantee 
that messages are delivered and in the right order. 

SCCP also has built-in network management functions. Each 
node in the network maintains the state of its SCCP users 
identified by their subsystem number. The SCCP layer is in 
charge of broadcasting the state of its own subsystem num- 
bers to the other nodes, so that at any time, an SCCP user 
knows about the state of the remote subsystems. 

TCAP. The objective of the Transaction Capabilities Applica- 
tion Part is to provide the means of transferring noncircuil- 
related information (unlike ISUP, which handles circuit- 
related information) between different nodes of the SS7 
network. TCAP is especially used to access service control 
points in the fixed network or to access home location regis- 
ters, a short message service center (SMSC), or an equipment 
identification center (EIC) in the mobile network. 

The TCAP layer is divided into two sublayers. The first, the 
transaction sublayer, deals with the exchange of TCAP 
messages. A transaction, called a dialogue at the user level, 
can be mist endured ( composed of one unidirectional TCAP 
message ) when no explicit initiation or termination is needed. 
For more interaction, a structured dialogue is used with a 
beginning, an exchange, and a termination or an abortion. 
This sublayer uses the SCCP connectionless service. 

The upper sublayer is the component sublayer and is dedi- 
cated to operations. An operation is an action (with parame- 
ters) to be performed by the remote end. Each operation is 
encoded into components, which are pari of a TCAP message 
payload. Components convey either an operation request or 
an operation response. Simultaneous operations are allowed 
inside a transaction and TCAP is able to support multiple 
simultaneous transactions with different remote TCAPs. The 
addressing for each TCAP user is the addressing provided 
by SCCP (point code and subsystem number or global title). 

Fig. 5 shows a TCAP interaction with the separation of the 
transaction layer and the component layer. 
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Fig. 5. A TCAP (Transaction Capabilities Application Part) trans- 
action, showing the transaction layer and the component layer. 
(ITI '-T terminology is shown, ANSI has equivalent services). 
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Fig. 6. A typical ISUP (ISDN User Part) interaction. 

ISUP. The ISDN User Pari (ISUP) is a circuit-related protocol, 
which means that it defines and transports the necessary 
messages to set up. tear down, and control voice and data 
circuits. It uses the MTP 3 services to transport messages 
from switches to switches. 

A typical ISt'P interaction is shown in Fig. (i. User A takes 
the receiver off the hook and dials 555-xxxx. The local 
switch (333) looks up its routing table and finds out that 
il should route I he call to switch 444. which is an access 
tandem (not the final destination). II then sends an initial 
address message lo switch 444 via the SS7 network and 
reserves a voice circuit lo swilch 444. When switch 444 
receives the initial address message, it reserves the other 
end of lite voice circuit, finds out that the call should he 
routed to switch 555, and sends another ISUP initial address 
message via SS7 to switch 555. Switch 555 accepts the call, 
reserves the voice circuit with switch 444. and sends back 
an address complete message to switch I I I. which forwards 
it to swilch 333, triggering the ring-hack of user A (via the 
voice path). Switch 555 also rings the destination phone. 
When user B takes the receiver off the hook, switch 555 
sends an answer message over the SS7 network to switch 
444, which forwards it to switch 333. The call can now 
proceed. 

The release phase uses the same kind of message interaction. 
ISIT also allows many other supplementary services. 

The BP OpenOall SS7 Platform 

The HP Open! 'all SS7 platform allows users to build com- 
puter-based signaling applications connected to the signal- 
ing network. Using computers lo achieve some of the intelli- 
gent network functions is one of the key benefits of the 
inlelligrni network architecture. < ompared lo modifying 



switch software, it is less expensive, faster, and easier to 
program computers in the intelligent network. The IIP 
OpenCall SS7 platform provides the hardware and the 
middleware necessary to use a computer in a signaling net- 
work. 

The main characteristics of the HP OpenCall SS7 platform 
are: 

• It provides the protocols to connect to the SS7 network. 
This mostly consists of specialized hardware (for MTP Levels 
1 and 2) and protocol stacks (MTP, SCCP, TCAP. ISIT) for 
various flavors (ANSI. ITU. Chinese, hybrid). 

• It provides high availability — most of the target applications 
are mission-critical (see article, page 65). 

• It provides the necessary' components for the computer to 
be integrated in a central office. This means, for example, 
support of a - 48Vdc power supply, antiseisnuc capabilities, 
compliance with established standards, and so on. 

• It provides open application programming interfaces (APIs) 
for users to write applications. 

The HP OpenCall SS7 platform is a platform in the sense that 
it does not provide the application itself but rather allows 
users to build the application. The platform can be instan- 
tiated under several options that will be described later. 

Core Protocol Implementation 

The network connection is made by a dedicated communica- 
tions unit called the signaling interface unit (see Fig. 7). 
Each signaling interface unit has a SCSI interface card for 
the host connection and three slots for signaling link inter- 
face cards. These cards provide two links each, with differ- 
ent options for each supported type of link (V.35. DSOA. and 
Tl/El ). These cards come from the HP 37900 SS7 protocol 
analyzers. The signaling interface unit can also be expanded 
by means of a dedicated expansion box, whic h provides four 
additional slots (eight more links) for a single SCSI attach- 
ment. Signaling interface units can be chained on the SCSI 
bus and the platform can currently support up to 04 links. 

Bach signaling link interface card runs the MTP 1 and MTP 2 
protocols and sentls and receives messages to and from the 
host via the SCSI interface. 

On the host side, messages are read by an SS7 driver built 
on top of the SCSI driver in the HP-UX* kernel. On top of 
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Fig. 7. The network conned ion is made by a dedicated commu- 
nications unit called the signaling interface unit. Each signaling 
int efface unit has a SCSI interface card for the host connection 
and three slots for signaling link interface cards. 



© Copr. 1949-1998 Hewlett-Packard Co. 



August Hewlett-Packard Journal 61 



TCAPAPI ti SCCPAPI U\ ISUPAPI 




MTP Level 2 and Level 1 



Fig. 8. For each layer of higher-level prolocols, ari operation, 
administration, and maintenance (OA&M) programmatic access 
is provided. 

these, a single user space proc ess implements MTP 3, SCCP, 
and TCAP, sending and receiving messages 10 and from the 
SS7 driver. 

APIs are provided as a library to be linked with the user pro- 
cess. The library is in charge of managing the interaction 
with the user application, implementing interprocess com- 
munication between the application and the SS7 stack, and 
supporting the flow control. 

For each layer of higher-level protocols, an operation, admin- 
istration, and maintenance (OA&M) programmatic access is 
provided (Pig. 8). This allows an application developer to 
control the state of the protocol stack or to implement man- 
agement applications (monitoring, configuration, etc.). 

Each layer is directly accessible via a direct API. Some APIs 
are simple wrappers that get the user parameters and mar- 
shall them to the stack. Other APIs, such as the ISUP and 
TCAP APIs, implement some part of the protocol in the 
library itself, allowing wider distribution of the processing 
load. All of the APIs are asynchronous to allow for high 
transaction rates. 

High Availability 

As explained above, one of the key aspects of the platform 
is high availability. The SS7 network has built-in high avail- 
ability capabilities and it is important that the end node also 
provide these capabilities. 

Our solution is based on the acti ve/standby model (see the 
article on page 65 for more details). To eliminate any single 
point of failure, every element is replicated (see Fig. 9). Two 
hosts are used, one being the active host and the other the 
standby host. Only the active host processes the traffic, while 
the standby just keeps its state up to date. For network 
attachment, we use dual-ported signaling interface units, 
and each unit is connected to two different SCSI chains 
terminating at each host. The dual-ported signaling interface 
imit has built-in logic such that one and only one SCSI bus 
can be active at any point in time. 

Each host has two SCSI interface cards, each connected to 
one half of the signaling interface unit set. Fig. 9 also shows 
how signaling interface units are chained. The active SS7 
stack, running on the active host, uses the two SCSI chains 
ternunating at it. The standby stack controls its two SCSI 
chains but does not have control of the dual-ported signaling 
interface units. The active stack has control of the signaling 
interface units via its two SCSI interfaces. In case of a failure 
of the active side, the standby side will take over and will 
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Fig. 9. High availability of the HP OpenCall SS7 platform is based 
on Ihe nctive/slandby model. 

take control of all the signaling Interface units by using its 
two SCSI buses. The switchover happens in less than six 
seconds to be transparent from the SS7 network point of 
view. Refer to the article on page G5 for more details on lite 
mechanism. 

The SS7 links of a given linkset must be connected to two 
different signaling interface units so that if one signaling 
interface unit happens to fail, the SS7 traffic will be routed 
transparently by the network to the surviving signaling inter- 
face units. Therefore, from a network attachment point of 
VigW, the architecture is more a load-sharing architecture, 
whereas it is an active/standby architecture at the host level. 

From an application point of view, the API hides the fact 
that there are in fact two SS7 stacks running. Application 
developers are free to use their own high availability mecha- 
nism, either load shared or active/standby. 

Distribution 

Another important aspect of the IIP OpenCall SS7 platform 
is its ability to support distributed applications. The key con- 
cept here is a front-end/back-end mode. The front-end com- 
puter supports the SS7 connection and protocol and the 
back-end computer supports the application. A typical con- 
figuration is shown in Fig. 10. 

The SS7 stack is able to distribute the traffic among several 
instances of the application running on back-end nodes. The 
application instances can run on several nodes and several 
instances can run on the same host. The API completely 
bides the distribution and the active and standby instances 
of the stack. Thus, an application can be configured to ran 
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Fig. 10. A from -end/back -end mode allows the HP OpenCall 
SS7 platform l.o support distributed applications. 

either on a simplex node (no high availability ), on a duplex 
node (active/standby), or on a back-end node without modi- 
fying anything in the code. All connections between the vari- 
ous systems are made over a dual LAN (possibly F'DDI for 
high-end systems) to eliminate any single point of failure. 

This flexibility allows users to use their own high availability 
and distribution schemes. 
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Stack Implementation 

As mentioned earlier. MTP 3, SCCP, and TCAP are imple- 
mented in a single user space process. The protocol imple- 
mentation started in 1988. The SS7 stack uses object-ori- 
ented technology and a message passing bus for interobject 
communication. Fig. 11 shows the stack implementation. 

Each layer has a software bus instantiated. Entities can dy- 
namically register on the bus, specifying whal kind of mes- 
sages they are interested in. Entities are object classes that 
model elements of the protocol. Typical objects are MTP 3 
links, S( 'CP remote subsystem numbers, and so on. Each 
object instance is associated with a unique key (object iden- 
tifier), usually extracted out of the protocol information, 
which allows very efficient dispatching by the bus. Entities 
Can send messages on the bus to be multicast to the target 
entities, calling one of their base class methods. 



Fig. It SS7 stack implementation in the HP OpenCall SS7 

platform. 

This method has proved to be very efficient in terms of 
encapsulation and coupling between objects. (Note that 
the MTP 2 layer does not implement the MTP 2 protocol but 
rather provides the interface with the MTP 2 implemented in 
the signaling interface units). 

Message Set Customization 

To extend the capabilities of the SS7 platform, it is neces- 
sary to provide more built-in protocols as they are adopted. 
These new protocols are built on top of TCAP and are used 
in the intelligent network or in the mobile network. Al- 
though standardized, the flavors of these protocols vary 
bom network to network and from vendor to vendor. The 
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same applies lot- isi IP, for which there is aboul one version 
per country. 

To deal with ihe ilivci-sity of message formats al Ihf product 
leve| without having l«> 'Id a special version for every new 
flavor of the protocols, we have developed a message set 
custom izat ifni technology to automate ihe customization 

Of a protocol. This Consists ofa message sel compiler in 

conjunction with a generic run-time engine to process ihe 
encoded messages (see Rg» vi). 

Thi' formal of ihe messages used l>y the protocol is defined 
in Abstract Syntax Notation 1 ( AS\. I ) wilh some specific 
annotations. ASN. I is the < )SI standard Tor defining data 
Structures and is used by most ol" Ihe protocols lhal we 
implement. However, ihe message set compiler technology 
is not restricted to ASN.I. A protocol such as [SUP, which 
is not defined in ASN.l. can also he accommodated 

The compiler generates a metadata file, which contains Ihe 
definition Of'tho messages. The run-time engine loads these 



metadata fifes and can immediately encode and decode new 
definitions of messages without impacting the API or requir- 
ing recompilalion or relinking. The benefit of this technology 
is lhal il can adapt the product lo the user's exact specifica- 
tions al Ihe latest possible Stage, without impacting the core 

product 
Performance 

The IIP ( tpenCali SS7 platform r an handle more than 2100 
SS7 transactions per second, a transaction being defined as 
one message into a dummy application and one message mil 
from the application. These figures were measured on an IIP 
IKIOI) Model K12(i host computer. The implementation is 
CPU-bound, so its capacity automatical]} increases when- 
ever more powerful system hardware becomes available. 

The constraints set up lor Ihe development were rather 
stringent and very similar to in-kernel development For 
example, no file system access is allowed except at startup, 
and all calls must be asynchronous. We are closely walching 
the I/O traffic to avoid tailing into l/( > bottlenecks. ( >nc ol 
Ihe reasons why we do not gel I/t ) bottlenecks is that we 
group as many SS7 messages as possible together before 

doing any transfer. For scsi. this is mandator]/ because 

SCSI is architected lor lather large 'lata transfers whereas 
SS7 handles VOTJ small messages l - 111(1 bytes i. Therefore, 
Ihe signaling interlace unit and the SCSI driver purposely 
introduce some latency lo transfer larger data blocks. 

HP-UX !) ' .mil 10 0 loi HP 9000 Sunes /00 and HOO compiilcs are X'Open Company UNIX 93 
handed products 

UNIX ii ,i leijistered trademark m Hie United Slates and other cnuntnes licensed exclusively 
Ih'ouyh X/Open Company limiled 

X/Open is a reyiswed tiademaii and ihe X device is a trademark ol X/Open Company Limned 
hi the UK and other cdonliiej 
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High Availability in the HP OpenCall 
SS7 Platform 



Fault tolerance in computer systems is discussed and high availability 
is defined. The theory and operation of the active/standby HP OpenCall 
solution are presented. Switchover decision-making power is vested in 
a fault tolerance controller process on each machine. 

by Brian C. Wyld and Jean-Pierre Allegre 



( tar lives arc increasingly dependent on OUT technology; Some 
things are trivial, like heing able to watch our favorite TV 
program, while some arc much more important, like medical 

equipment- When you start to look at the difference technol- 
ogy makes to our lives, stalling from the convenience, you 
begin In appreciate the problems we would have if it were to 
break dow n. 

Some breakdowns are merely irritable. Not being able lo 
phone lo arrange a night out isn't going lo kill anyone. Bui 
when you can't phone for help in an emergency, then a lack 
of the sen ice we all lake for granted is a lol more serious 
Being able lo ensure that the technology we use every day 
keeps working is therefore a pan of the functionality, as 
much as providing the service in the first place. 

Although il would be nice if things never broke down, we all 
know thai ibis is impossible. Everything has a flaw — entropy 

always gels us in Ihe end. The Unsuitable ship sinks, the 
Uninterruptible power supply gels cut. thi' unbreakable plale 
proves lo be only loo breakable (usually with Ihe assistance 
ofa child). 

We all depend in one way or another on the continued func- 
tioning of our technology, so we all have some dependence 
on Che tolerance of that technology to the faults thai will in- 
evitably strike il. When a computer malfunction can disrupt 
the lives of millions ol people, faull tolerance is nol just 
nice, bill absolutely necessary 

Computer Fault Tolerance 

Computer faull tolerance covers a range of functionality. 
The important aspects lo consider are Ihe speed of recovery 

in the presence of a fault, the perception, of the users in case 

ofa fault, and the coiiset|iiciices lo ihe application ofa faull. 

Wiih these in mind, the following degrees of faull tolerance 
an' often defined. 

Reliable Service. The System is built to be as reliable as pos- 
sible. No effort lo provide lolerance of faults is made, but 
every part of Ihe system is engineered lo be as reliable as 
possible lo avoid the possibility ofa fault. This includes both 
ihe hardware, with either overengineered components or 

rigorous lesling, and Ihe software, with design methods that 

attempt to ensure hug-free code and user interfaces designed 
lo prevent operator mistakes. Reliability rales of 99.9996 can 

be achieved 



High Availability. The emphasis i- on making ihe Service 
available lo Ihe user as much of Ihe lime as possible. In the 
event ofa fault, the user may notice some inconsistency or 
interruption of service, bin will always be able lo reconnect 
lo the service and use it again either immediately or within a 
sharply bounded period of lime. A reliability rate of 99.99994 
is t he target. 

Continuous Availability. This is the pinnacle of fault tolerance. 
When a fault occurs, the user notices no interruption or in- 
consistency in die service. The service is always there, never 
goes away, and never exhibits any behavior thai leads the 
user to think that a faull might have happened. Needless lo 
say, this level is both difficult and expensive lo obtain. The 
reliability rale is 100%. 

In general, a fault tolerant system will be highly available in 
most parts, with touches of continuous av ailability. 

Achieving Fault Tolerance 

Fault tolerance is usually achieved by using redundant C - 

ponents. Based on ihe assumption thai any component, no 

mailer how reliable, will eventually eilher fail or require 
planned maintenance, every component in the System dial 
is v ital is duplicated. This redundancy is designed so that die 
component can be removed with a minimal amount of dis- 
ruption io ihe operation of the service. For instance, using 

mirrored disk drives allows a disk lo fail or be disconnected 

w ithout altering the availability of ihe dam stored on the disk. 

The way the redundancy is designed varies according to the 
paradigm used in Ihe system. There are several ways lo 
build in the redundancy, depending on the use made of the 
redundant components and how consistency is maintained 

between them, 

Multiple Active, 'fhe redundant components may ill fad be 
used simply lo provide Ihe service in a load-sharing way. 
In Ibis case, dala and functionality are provided identically 
by all the components. The load from the users of the ser- 
vice is spread across Ihe Components so that each handles 
a part of Ihe load, In tin- event ofa component failure, die 
load is taken up by the others, and their load increases 

correspondingly 
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Active/Standby. In I his paradigm, there exist one or more 
active components, which provide the service to the users, 
and in parallel, One or more standby components. These 
provide a shadow to the actives and in the event of an active 
foiling, change state to become active and take over the job. 
Variations on Ihe I heme involve one-to-one active/standby 
pairs, N actives to one standby, anil N actives to M slandbys. 

Coupling 

How the service is affected by the failure is also important 
The redundant component, whether it was also providing 
the service Or not, can be loosely or tightly coupled to the 
other components. 

When loosely coupled, the redundant component only has a 
view of Ihe state of ihe active at certain times — at the end of 
a Iransaction, for instance. This has (he effect thai a failure 
of the component while processing a user's request w ill lose 
Ihe context Any finished work will be unaffected, bill work 
in progress is losi and Ihe user must restart. However, the 
effort required to keep Ihe standby coupled is low. 

A tightly coupled component will remain in step with the 
component processing Ihe request, so thai it can takeover 
seamlessly in the event of Ihe fault. The workload is much 
higher because many more messages inusl be exchanged, 
and Ihe speed of ihe Operation may be slower lo ensure that 
at every stage the standby is in step with Ihe active. 

Of course, many shades and granularil ies of loose versus 
tight coupling are possible in a single system. 

Traditionally, hardware fault tolerant systems have been ex- 
ponents of Ihe tight coupling par adigm: two or more proces- 
sors executing exactly the same instructions in synchroniza- 
lion. with Ihe outputs selected on either an active/standby 
basis or by a veiling system. Soil ware systems have leaned 
more towards Ihe loose coupling method, at various levels 
of granularity. For instance, there are database transactional 
paradigms in which user database accesses are bundled into 
transactions, and only once a transaction is committed does 
thai Iransaction become certain and unaffected in Ihe event 
of a failure. 

Predicting and Measuring Fault Tolerance 

Various statistical methods exist lo measure Ihe fault loler- 
ance of a system in a quantitative manner. These usually use 
the standard measures of system failure such as MTBF 
(mean time between failures) and MTTK (mean time to 
repair), and are combined to give a forecast of application 
downtime. However, the values of downtime produced by 
such methods can be inaccurate, and sometimes bear little 
resemblance to Ihe true values. 

The main reason for this is that failures may not be isolated 
and uncorrclated. and this is very difficult to take into 
account. Simply predicting from the MTBF and MTTK that 
Ihe chance Of a single failure bringing down Ihe entire sys- 
tem is very small is not realistic when the single failure will 
often provoke Subsequent related failures, often in the pan 
of the system trying to recover from Ihe fault. Mosl fault 
tolerant systems and related statistical analysis are based 
on an assumption of a single failure, and systems are built 
to avoid a single point of failure. In practice, and in the often 



inconvenient real world, failures can happen together, and 
can cause other failures in turn. 

The other assumption thai causes trouble is thai of silent 

failure. It is oil en assumed that when a component fails, il 
does so silently, thai is, in a failure mode I hat doesn't affect 
oilier components. For instance, having a dual LAN between 
several computers to avoid the LAN's being a single point of 
failure doesn't help when a crashed computer decides lo 
send out nonsense on all of its LAN interfaces* effectively 
preventing use of any LAN. 

Downtime Causes 

Things that cause downtime on systems can be grouped into 
several main categories. Firsl is Ihe obvious computer hard- 
ware failure. This may be caused by a component's lifetime 
being exceeded, by a faulty component, or by an oul-of- 
specification component. Often, hardware failures in one 
Component can cause oilier components lo fail. Many com- 
puter systems are nol constructed to allow a single Compo- 
nent to fail or to be replaced without affecting other Compo- 
nents, For instance, a failed disk drive on a SCSI bus will 
force the entire system to be halted for its replacement even 
though only one component has failed. 

This often implies that avoiding the single point of failure 
means adding more liardware than might seem reasonable 
— a second SCSI controller card and chain for instance, so 
thai the backup disk drive can be on a separate S( SI bus. 
Reliable hardware, coupled with a system built to allow hot- 
swappable components can do a lot to eliminate this source 
of downtime. 

The second obvious cause of downtime IS software failures. 
No software will ever be entirely bug-free. Even formal 
methods, quality reviews, and all the rest of the Mappings of 
Computer science cannot keep those elusive problems from 
slipping in. The main problem with bugs is that the ones thai 
escape to released systems are usually in code that has not 
been well-tested. This may often be the code designed to 
recover from failures, since this is difficult lo cover fully 
with tesiing. Recovery may also often involve a higher load 
than normal as standby processes become activ e, load files, 
and so on. and this can often expose lurking bugs. The net 
effect is that when your application crashes and burns, your 
Standby application, ready and waiting lo continue the ser- 
vice, promptly follows it down in flames instead. 

Another not so obvious but very real source of downtime 
is operator intervention. Although in theory, operators of a 
system will always follow procedures and always read the 
manual, in practice they are prone lo typing rm " (deleting all 
of the files on Ihe disk), and pulling out Ihe wrong power 
plug. Even when Ihe mistake is nol so obvious, mistakes 
such as badly configured systems, too much load on a criti- 
cal system, or enabling unneeded tracing or Statistics can 
bring the system down. 

No amount of clever fault tolerant algorithms or mathemati- 
cally proven designs will help here. However, a carefully 
planned syslem configuration, with working defaults and 
a user interface that is designed lo help the user make Ihe 
correct choices by presenting Ihe correct information in 
a timely and obvious fashion, can go a long way towards 
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avoiding these sorts of problems. This is. unfortunately, an 
often neglected pan of the system design, with attendant 
problems. When building a system that end users will use as 
a platform, with programming interfaces, the importance of 
providing usable interfaces becomes even greater. 

Finally, there are the disasters. The absolutely unforeseeable, 
one-in-a-million. couldn't happen in a lifetime chances. Who 
could predict that the earthmoving equipment would knock 
down the pylon supplying the mains electricity to the com- 
puter center, which would fall onto the telecommunications 
lines, blowing every piece of data communications equip- 
ment, after which it would careen into the hole where the 
w ater main was being repaired, breaking it and flooding the 
basement where the batteries for the UPS are installed, 
completely destroying them? "Impossible," you might say, 
but it has happened. In such cases, geographically separated 
sites can prove to be the only possible solution if a 100% 
available system is really required. This does rule out certain 
forms of fault tolerance — any form of dual-ported hardware, 
for instance, or lockstep processors — but is possible with 
software fault tolerance techniques. 

Telecommunications Fault Tolerance 

The requirements on a fault tolerant system vary with 
the application. In telecommunications, we see different 
requirements being demanded depending on the element 
being addressed. On the billing services side, the require- 
ments are biased towards ensuring that no data loss occurs. 
Limited application downtime is acceptable but any billing 
data should be safe. This sort of system is similar to the 
requirements of any database-oriented application, and tech- 
nologies such as mirrored disks and reliable systems are 
usually sufficient. 

For operations services, which provide the management of 
the network, certain essential administration and manage- 
ment functions should always be available so that control 
over the network is always maintained. In the service provi- 
sion environment for which the HP OpenCall SS7 platform is 
designed, the essential requirements are to avoid disruption 
to the network, to have a continuously available service, and 
to avoid disruption to calls in progress in (he event of a 
fault. 

To avoid disruption to the network, the SS7 protocol provi- 
sion has to avoid ever being seen as down by the network. 
This essentially means that in the event of a fault, normal 
protocol processing must resume within six seconds. Any 
longer than this and the SS7 network will reconfigure to 
avoid the failed node. This reconfiguration process can be 
very load-intensive and can cause overloads in the network. 
This effect is to be avoided at all costs. 

To provide continuous availability requires that the applica- 
tion that takes over service processing from a failed applica- 
tion must be at all times in the same state with respect to its 
processing. This is also required to ensure that current calls 
in progress are not disrupted. The state and data associated 
with each call must be replicated so that the user sees no 
interruption or anomaly in the service. 



HP OpenCall Solution 

To fulfill all of these requirements for a telecommunications 
services platform is not an easy task We chose to implement 
a simple active/standby high availability fault tolerance model 
that is capable of providing most customer needs. 

To achieve high availability, we need to replicate all the hard- 
ware components (see Fig. 1). We have defined a platform 
as being a set of two computers (usually HP 9000 Series 800) 
interconnected by a dual LAN, equipped with independent 
mirrored disks and sharing a set of SS7 signaling interface 
units via redundant SCSI chains (see the article on page 58 
for more details). 

The liighest constraint on the system is to be able to perform 
a switchover in less than six seconds. The SS7 protocol MTP 
Level 2 running on the signaling interface unit can tolerate a 
6-s interval without traffic. If this limit is exceeded, the SS7 
network detects this node as down, triggering a lot of alarms, 
and we've missed our high availability goal. 

In a nutshell, the high availability mechanism works as 
follows. One system is the active system, handling the SS7 
traffic and controlling all the signaling interface units. In 
case of a failure on the active side, the standby system gets 
control of the signaling interface units and becomes the new 
active. During the transition, the signaling interface units 
start buffering the data. When the buffers are full (which 
happens rapidly), the signaling interface units start sending 
MTP Level 2 messages to the other end to signal a transient 
outage. If this outage lasts more than 6 s, the SS7 network 
detects this node as down, so it is critical that in less than 6 s, 
a new active system take over. The failure detection time is 
the most crucial one. We need to detect failures in less than 
four seconds to be able to perform a safe switchover. 
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Fig. 1. The high availability solution in the HP OpenCall SS7 plat- 
form calls fur replication of all hardware components. 
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Fig. 2. A process Starting up with an active process running is m 
tin.' down state. When il reaches the hot standby slate, it is ready 
Co become the active if the active goes down. 

The architecture goal is to protect the platform from failing 
in case of a single failure. In general, the dual-failure case 
leads to a service loss, even though some cases may be 
recovered. 

Software Model 

The HP OpenCall SS7 platform is based on an active/stand- 
by model, with a pair of UNIX* processes providing a ser- 
vice. Only one of I he pair is aciually doing the job at any 
time (the active) while its peer process (the Standby) is idle, 
waiting to take over in the event of a failure. 

The service provided by the process may be the SS7 protocol 
stack, centralized event management, or a telecommunica- 
tions application. The standby process is not completely 
idle. Il musl be kept up lo dale wilh the state of the active 
process to be able lo resume processing from the same 
point if a failure occurs. When in this state, it is hot Standby 
(see Fig. 2). 

Consider a process starting up when an active process is 
already running. A process is initially down, that is, not run- 
ning. When it is started, it performs whatever startup process 
is required (boot in;;), and then is cold standby. In this state, 
it is correctly configured and could perform the service if 
required, but all current clients would see their stales being 
lost. 

This would be enough to give a highly available service, but 
would not fulfill the requirement to avoid disruption to cur- 
rent clients. The process musl therefore now synchronize 
itself with the active process, during which time it is synchro- 
nizing. Once il is completely up to date, it is hot standby. 
In this state, current clients should see no disruption if I he 
active process fails. 

Obviously, if no aclive is running, the process goes to active 
from cold standby, since there are no current clients. 

Once there exists this pair of processes, one active provid- 
ing the service and one standby providing the backup, the 
system is ready to deal with a failure. When this occurs, the 
failure musl first be detected, and then a decision on the 
action to be taken must be made. 



Fault Tolerance Controller 

The HP OpenCall SS7 platform centralizes the decision- 
making process into a single controller process per 
machine, which is responsible for knowing the states of all 
processes controlled by it on its machine. Il has a peer con- 
troller (usually on the peer machine) which controls all the 
peer processes. These two fault tolerance controllers make 
all decisions with regard lo which process of the pair is 
active. Each high availability process has a connection to 
both the faull tolerance controller and to its peer. 

The fault tolerance controllers also have a connection be- 
tween litem (see Fig. 3). The A channel allows the two fault 
tolerance controllers to exchange state information on the 
two system processes and lo build the global state of I he 
platform. The A channel also conveys heartbeat messages. 
The B channels allow the faidt tolerance controllers to pass 
commands governing the state of the processes, and lo 
receive their slate in return. A process cannot change state 
to become active without receiving a Command from the 
fault tolerance controller. Because the faidt tolerance con- 
troller has information on all high availability processes and 
on the state of the LAN, CPU, and all peer processes, it can 
make a much better decision than any individual process. 
Finally, the C channels are replication c hannels. They allow 
peer processes lo replicate I heir state using an application 
dependent protocol. 

Failure Detection 

For the success of any faull tolerant system, failures of any 
component musl be delected quickly and reliably. This is 
one of the most difficult areas of I he system. The HP Open- 
Call SS7 platform uses several mechanisms to detect various 
kinds of faults. 

To detect a failure of one of the high availability processes, 
a heartbeat mechanism is used between the fault tolerance 
controller and the high availability process via the B channel. 
I NIX signals are also used to detect a failure of a child pro- 
cess (the fault tolerance controller is the parent of all high 
availability processes), but they provide information only 
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Fig. 3. The fault tolerance controllers make .ill decisions with 
regard to which process is aclive. Each high availability process 
lias a connection lo both the fault tolerance controller and to its 
peer. A heartbeat mechanism helps detect failures. 
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when a process exits. To detect more subtle process faults 
such as deadlocks or infinite loops, the heartbeat mechanism 
is required, but it has the drawback of mandating that every 
high availability process be able to respond to heartbeat 
messages in a timely fashion, usually around every 500 ms. 
This is not so critical in our environment since we expect 
our processes to behave in a quasi-real-time manner, but it 
rules out using any potentially blocking calls. 

To detect system hang, we use a specific mechanism imple- 
mented in the fault tolerance controllers. Each fault toler- 
ance controller, running in HP-UX* real-time priority mode 
(rtprio ), uses a real-time timer that ticks every 2 s (typically). 
At every tick, the fault tolerance controller checks the dif- 
ference between the last tick and the current time. If this 
difference exceeds a certain tolerance, this means that the 
system has been hung for a while, since the fault tolerance 
controller is configured to have the highest priority in the 
system and should therefore never be prevented from re- 
ceiving a real-time timer. Upon occurrence of such an event, 
the fault tolerance controller exits after killing the olher high 
availability child processes. As strange as it may sound, this 
is the safest thing to do. If the system has been hung for a 
while, the peer fault tolerance controller should have also 
delected a loss of heartbeat and should have decided to go 
active. If we were to let the fault tolerance controller of the 
hung system keep running once it wakes up, we would have 
two active systems, with all the possible nasty effects this 
entails. 

The two fault tolerance controllers also exchange heartbeat 
messages (along with more detailed state information). 
Should a heart beat fail, the fault tolerance controller of the 
active side will assume that the peer is down (for example, 
bec ause of a dual-LAN failure, a peer system panic, or a peer 
taut! tolerance controller failure) and will do nothing 
(except log an event to want the operator). If the fault toler- 
ance controller of the standby side detects this event, the 
fault tolerance controller will assume that something is 
wrong on the active side. Il will decide to go active and will 
send an activate command to all of its high availability pro- 
cesses. If the old active is indeed dead, this is a wise deci- 
sion and preserves the service. In the case of a dual-LAN 
failure (this is a dual-failure case that we are not supposed 
to guard against ). we may have a split-brain syndrome. 
In our case, we use the signaling interface unit (see article, 
page 58) as a tiebreaker. If the active SS7 stack loses control 
or the signaling interface unit (because the peer slack has 
taken control of it), il will assume that the other system is 
alive and will exit, asking the fault tolerance controller not 
to respawn it. Operator intervention is necessary to clear 
the fault and bring the platform back into its duplex state. 

Dual-LAN Support 

At the lime the HP ( IpenC'all SS7 platform project started, no 
Standard mechanism existed to handle dual I-ANs. nor did we 
want to implement a kernel-level dual-LAN mechanism. We 
therefore selected a user space mechanism provided by a 
library Chat hides the dual LAN and provides reliable mes- 
sage-based communication over two TCP connections 
(Pig. 4). 

A iih'ssiihc library provides the dual-LAN capability and 
Kl$8S&ge boundary preservation. The message library opens 
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Fig. 4. The dual-LAN mechanism is a user space mechanism 
provided by a library I hat hides the dual LAN and provides reli- 
able message-based communication over two TCP connections. 

two TCP connections, one on each LAN. Only one LAN is 
used al a time (there is no attempt to perform load sharing). 
On each of the TCP connections, we maintain a small traffic 
of keep-alive messages (one every 500 nis), which contain 
just a sequence number. On each TCP connection, the mes- 
sage library monitors the difference between the sequence 
numbers on each LAN. If the difference exceeds a given 
threshold, one LAN is assumed to be either broken or over- 
loaded, in which case the message library decides to switch 
and resume traffic on the other LAN. No heartbeat timer is 
used. Only differences in round-trip lime can trigger a LAN 
switch. The benefit of this solution is thai it is independent 
of the speed of the remote process or remote machine and 
scales without tuning from low-speed to high-speed LANs. 
It also allows very fast LAN switching time. 

A drawback is the sensitivity of litis mechanism to a loaded 
LAN, which is perceived as a broken LAN. For this, we rec- 
ommend that an extra LAN be added to the system dedicated 
to application bulk traffic. Another problem is thai when 
switching LANs, we have no way of retrieving unacknowl- 
edged TCP messages in retransmit on the new IAN, so we 
end up losing messages upon a LAN switch. Some parts of 
the platform guard themselves against this by implementing 
a lightweight retransmission protocol. 

Access to High Availability Services 

An important objective of the HPOpenCall SS7 platform is to 
shield the application writer from the underlying high avail- 
ability mechanisms. We came up wilh the scheme illustraled 
in Fig. 5 to access the high availability processes. 

Lei's take the example of the SS7 process. The .S'.S'~ lihrnri/ 
maintains two message library connections ( four TCP con- 
nections because of the dual LAN): one with the active in- 
stance and one wilh the standby. The lift library (user faull 
tolerance library ) transparently manages the two connections 
and always routes the traffic lo the active instance of the 
stack. Upon switchover, the SS7 process informs Its clieiil 
library via a surviving message library connection that il is 
now the new active and that traffic should be routed in il 
From an API point of view, the two connections (four 
sockets) are hidden from the user by exporting an Id set as 



© Copr. 1949-1998 Hewlett-Packard Co. 



AngaM IflflTBewtoO •Psctard Journal 69 




used by select!) Instead of'a file descriptor. The application 
main loop should be built along the following lines: 

while (1) { 

API -pre -select linn, &wm, &em, 4 timeout ) ; 

// application possibly adds 

// its own fd in the mask here 

// FD_SET(rm,myFd) ; 

select (&rm, &wm, &em, &timeout ) ; 

API -post -select (&rm, &wm, &em, Stimeout ) ; 

// application possibly checks 

// its own fd here 

// if (FD_ISSET(rm,myFd) ) {} 

) 

The application main loop must continuously call the pre- 
select function of the API to get the accurate value of fd_set 
(sockets can be closed and reopened transparently in case 
of failure), then call select!), possibly after having set some 
application-specific file descriptor in the mask, then call the 
API postselect function with the mask rel tinted by select!). In 
the postselect phase, the library handles all necessary proto- 
col procedures lo maintain an accurate view of the state of 
the high availability process, along with user data transfer. 

State Management 

One of the key aspects of the active/standby paradigm is the 
definition of the process state and how precisely it must be 
replicated 'Hie framework described above does not enforce 
how state management should be performed. It provides a 
replication channel between the two peer processes and 
information about the processes, but no specific semantics 
for state. Different schemes are used depending on the na- 
ture of the state to be replicated. A key element to consider 
when designing such a system is the state information that 
must be preserved upon switchover and its update frequency. 
For instance, blindly replicating all state information in a 
system targeted at -1000 messages per second would be highly 
inefficient, because the replication load would exceed the 
actual processing. 



Fig. 5. The method of accessing 
l In' high availability processes is 
designed to sliield tin- application 
writer from the underlying high 
availability mechanisms. 



For these reasons, we have not set up a generic state repli- 
cation mechanism, but rather build ad hoc mechanisms de- 
pending on the nature of the slate. For example, on the SS7 
stack, the MTP ■! protocol has no state associated with data 
transfer (such as window value, pending timer, connection 
stale), but has a lot of network management information 
that must not be lost in case of switchover. 

The policy has been lo intercept MTP : i management mes- 
sages coming from the network or from the OA&M (opera- 
tion, administration, and maintenance) API and send them 
to the standby via the replication channel or the AIM. The 
standby stack processes the MTP 3 management messages 
in the same way as the active and the computed states are 
identical. 

Tt'AP transactions are not replicated because of the high 
rates of creation and deletion and the amount of state infor- 
mation associated with the component handling. The effect 
is that opened TCAP transactions are lost in case of switch- 
over. Work is progressing on a scheme that preserves trans- 
actions by letting the user of the transaction decide when the 
transaction becomes important and should be replicated. 

An alternative to replicating messages is to replicate the 
stale alter it has been computed from the message. The usual 
algorithm for this scheme is to do the computation on the 
active side, use an ad hoc protocol lo marshall the new state 
to the standby, and let the standby update itself. If the stand- 
by fails lo replicate the state, it decides to go to the down 
state and will be restarted. 

Another important design aspect for the high availability 
system is the synchronization phase. A starting cold standby 
system has to perform the cold standby to hot standby tran- 
sition by getting all the state information from the active and 
rebuilding it locally. This operation should disturb the active 
as little as possible, but care must be taken that the algorithm 
converges. If the state of the active changes faster than the 
standby can absorb, there is a risk that the standby may 
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never caich op. This is usually addressed by assuming lhat 
the standby has much more CPU available than the active, 
and if necessary, slowing down the active. In the case of SS7 
signaling, the amount of slate information is rather small 
Sod the information is stable, so we use a relatively simple 
algorithm. The synehronizal ion is performed by having the 
standby fork a helper process that, via the SS7 OA&M API. 
dumps the content of the state information to disk and ihen 
replays it to the standby via the API. To check that the state 
is correct and that the standby can go to hot standby, the 
standby stack initiates ait audit phase that checks that the 
two configurations are identical. If this is not the case, the 
process is resumed. Otherwise, the state of the standby goes 
to hot standby. This is a simple implementation, but has 
proven to be sufficient for SS7. 

Technical Challenges 

Developing a high availability platform on the HP-UX opera! - 
ing System has been a great challenge, but we've obtained a 
very stable and operational product, deployed in hundreds 
of siles worldwide. 

One of the technical challenges was lhat HP-l X is not a real- 
time operating system and we need determinism to handle 
ihe high availability aspects, especially with Ihe very small 
reaction time (<li s ) thai we are allowed. We've addressed 
this by forbidding some operations (like file system access 



and potentially blocking system calls) in time-critical pro- 
i -esses such .is ihe SS7 stack. b\ sluing every long-lived 
operation into small operations, and by trying to stay below 
the saturation limit (where res|>onse time starts to increase 
rapidly I. For example, we recommend keeping overall 
f PI ' utilization below 85% and staying far below the LAN 
maximum bandwidth. 

Another challenge was time synchronization between ihe 
VArlOUS hosts. We do not need time synchronization between 
Ihe hosts for proper operation, but some of our customers 
request it We've used the NTP package (Network Time 
Protocol), which has proved to work reasonably well except 
when the clock drift between the hosts was too large to be 
compensated smoothly and NTP decided to suddenly jump 
the system clock to catch up. This caused problems for 
synchronization of events, and also fired our failure detec- 
tion mechanisms. We resolved these problems using exter- 
nal clocks and configuring NTP in a controlled manner to 
avoid such lime jumps. 

HP-UX 9 • and in 0 (ni HP 9000 Series 700 and BOO computers are X/Qpen Companv UNIX 93 
branded products 

UNIX is a registered trademark in Ihe United Stales and other countries, licensed exclusively 
through X/Open Company Limited 

XJUpen is a registered trademark and the X deuce is a trademark ol X/Upen Company limited 
in lite UK and other cuuntties 
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A Benchtop Inductively Coupled 
Plasma Mass Spectrometer 

The HP 4500 is the first benchtop ICP-MS. It has a new type of optics 
system that results in a very low random background and high sensitivity, 
making analysis down to the subnanogram-per-liter (parts-per-trillion) 
level feasible. It can be equipped with HP's ShieldTorch system, which 
reduces interference from polyatomic ions. 



by Yoko Kishi 



Inductively coupled plasma mass spectrometry (ICP-MS) is 
an analytical technique that performs elemental analysis 
with excellent sensitivity and high sample throughput. The 
ICP-MS instrument employs a plasma (ICP) as the ionization 
source and a mass spectrometer (MS) analyzer to detect the 
ions produced. It can simultaneously measure most elements 
in the periodic table and determine analyte concentration 
down to the subnanogram-per-liter or part-per-trillion (ppt ) 
level. It can perform qualitative, semiquantitative, and quan- 
titative analysis and compute isotopic ratios. 

The schematic diagram of an ICP-MS instrument is shown in 
Fig. 1. Basically, liquid samples are introduced by a peristaltic 
pump to the nebulizer where a sample aerosol is formed. A 
double-pass spray chamber ensures that a consistent aerosol 
is introduced to the plasma Argon (Ar) gas is introduced 
through a series of concentric quartz tubes, known as the 
ICP torch. The torch is located in the center of an HP coil, 
through which 27.12-MHz RF energy is passed. The intense 
RF field causes collisions between the Ar atoms, generating 
a high-energy plasma. The sample aerosol is instantaneously 
decomposed in the plasma (plasma temperature is in the 
order of 6,000 to 10.000K) to form analyte atoms, which are 
simultaneously ionized. The ions produced are extracted 
from the plasma into the mass spectrometer region, which 



is held at high vacuum (typically tO -6 Torr, 10-> Pa). The 
vacuum is maintained by differential pumping. 

The analyte ions are extracted through a pair of orifices, 
approximately 1 mm in diameter, known as the sampling 
cone and the skimmer cone. The analyte ions are then 
focused by a series of ion lenses into a quadrupole mass 
analyzer which separates the ions based on their mass/ 
charge ratio (m/z). The term quadrupole is used because the 
mass analyzer is essentially four parallel molybdenum rods 
to winch a combination of RF and dc voltages is applied. 
The combination of these voltages allows the analyzer to 
transmit only ions of a specific mass/charge ratio. Finally, 
the ions are measured using an electron multiplier, and data 
at all masses is collected by a counter. The mass spectrum 
generated is extremely simple. Each elemental isotope ap- 
pears at a different mass (e.g. -'Al would appear at 27 anui) 
with a peak intensity directly proportional to the initial con- 
centration of that isotope. The system also provides isotopic 
ratio information. 

New Benchtop ICP-MS 

Tine HP 4500 is the world's first benchtop ICP-MS (see Fig, 2). 
The reduction in instrument size is dramatic: the size of the 
previous model is 1550 by 900 by 1450 mm, while that of the 



ICP torch 



- Detector 



Plasma Gas — 
Auxiliary Gas 
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Fig. 1. HP 4500 ICP-MS schematic 
diagram. 
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Fig. 2.11!' i500ICP-MSi 



HP 43(H) is 1 l()(i by lino liy 3*2 mm. Previous generations of 
K'I'-.MS instruments had requirements — space, utilities, anil 
envininnient— that dictated thai a special room be dedi- 
cated for the instrument Installing an kt-ms could he 

particularly difficult, since major construction changes were 
often required. 

The HI' 1300 is smaller and lighter so that it can he installed 
on an existing bench The layout of ihe instrument is de- 
signed to make user interaction with the sample introduction 
system. Ihe inlerfaccs. and Ihe ion lenses routine. All pails 
can In 1 accessed from Ihe front and connected or discon- 
nected easily. These and other new Features and technology 
introduced and used by ihe IIP 13(H) help to make [CP-MS a 
more routine and therefore a more accessible technique. 

Ion Lens .System 

The configuration of Ihe ion lens system is one of Ihe key 
design issues because il directly affects Ihe ion transmission 

efficiency of an K t-ms system. Various ion lens configura- 
tions were produced and evaluated lii determine the optimum 
Configuration and operating conditions for Ihe IIP 1600. Ion 
trajectories through each ion lens system were predicted 
mathematically. 

The IIP 1500 is equipped With B new type of optics system, 
as show n in Fig. 3a. The Cfltiega lens consists of a pair of 
crescent -shaped lenses that resemble ihclircck letter £2. 



The optics system contains two omega lenses, the omega + 
and omega- lenses, winch bend the ion beam, allowing the 
quadrupole and detector to be mounted off-axis. This pre- 
vents photons from reaching the deleclor (which would 
increase random background noise), and also focuses the 
ions very efficiently. The result is a very low random back- 
ground and high sensitivity, making ultratrace analysis down 
to the subnanogram-per-liler level feasible. In contrast, other 
[CP-MS Systems employ a photon stop lens system as show n 
in Fig. -ib. 1 Ions are defocused after extract ion into the main 
vacuum chamber and Ihen refociised, While photons are 
blocked by ihe pholon slop. Willi this design, some ions in- 
evitably collide with Ihe pholon stop and are lost, so overall 
transmission is reduced. 

An example of ion trajectory mapping for the optics syslein 
of Fig. :ia is shown in Fig. 4. In this example. Ihe initial ion 
energy was estimated at III eV and Ihe space-charge effect 2 
was ignored. The broad trace in the center shows ihe ion 
trajectories for Ihe lens voltage settings shown. Starting from 
the left, the lenses and their voltages are: skimmer cone (no 
voltage), extraction lens 1 (-KiOV). extraction lens 2 (-70V). 
ein/el lens I (-KHIV). cinzel lens 2 (SV). ein/el lens 9 (-KHl\ ). 
omega bias lens (-:i3V), omega + lens ( IV), omega- lens 
(-"A ), quadrupole focus and plale bias lenses (-I0V). The 
ein/el lenses are a traditional elect roslalic lens system in 
w hich the voltage on the center lens is different from the 
voltage on the other two lenses. 





Kin. 3. Ion long ystom 

(a) HP 4500 omega lens system 

(li) PhOIOtl BtOp system 
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Fig. 4. Example of ion trajectory mapping. 
Dual-Mode Detection System 

Hie dynamic range Of the ICP-MS system is extended from 
six to eight orders of magnitude in the HP 4500 by a newly 
developed dual-mode detection system. The electron multi- 
plier used in I he dual-mode system is a discrete dynode type 
operated in both pulse count and analog modes. 

The block diagram of the dual-mode system is shown in 
Fig. 5. When an ion enters the electron multiplier, il hils the 
first dynode and a shower of electrons is generated. These 
electrons hit the next dynode, generating more electrons. 
Finally. I he pulse generated is detected by the collector. This 
small signal is amplified and a measurable pulse signal is 
obtained. At this point, the output signal from the amplifier 
contains both electrical noise and the pulse signal. After the 
amplifier, 'he electrical noise is eliminated by a discriminator 
circuit and pulse signals higher than the discriminator voltage 
are conv erted to an ideal pulse shape. This pulse is measured 
as one count. 

At very high analyte concent rations (>) mg/1 in the sample 
solution), detector saturation occurs, so the dual-mode sys- 
tem is automatically switched to analog mode and the ion 



current is measured. The ion current is converted to a fre- 
quency by a voltage-lo-frequency converter and measured as 
counts per second. 

The dual-mode detector system extends the maximum work- 
ing range of the instrument up to approximately 100 mg/1. 
The appropriate mode for each isotope is selected automati- 
cally by the HI" ChemSlation operating software, and dual- 
mode data is acquired simultaneously, which is another first 
for If'P-MS. The great benefit is that samples containing a 
range of analytes al different concentration levels can be 
analyzed in a single analysis. 

Without dual-mode operation, dilution, preconcentration, or 
other complicated sample preparation and steps would be 
involved. II is inevitable thai as the process for sample prep- 
aration gets more complex, an increasing number of errors 
and contamination will occur. Contamination during sample 
preparation is always of concern when analyzing elements 
at trace levels. 

The ShieldTorch System 

Although the ICP-MS generates essentially monatomic. post 
lively charged analyte ions, there are still several polyatomic 
ions such as Art ), ArC, and Aril, which arise mainly from the 
combination of the argon gas used to generate the plasma 
with oxygen, carbon, and hydrogen from the air and the 
samples. The main interferences are shown in Table I. 

Table I 

Typical Interferences in ICP-MS 



Analyte 


m/z 


Intederant 


K 


39 


3S Ar'H 


Ca 


40 


%r 


Ca 


44 


12 C 16 02 


Cr 


52 


*>Ar ia C 


We 


56 


"'Ar'^O 



The IIP 4500 can be equipped with HP's proprietary technol- 
ogy called the ShieldTorch system, which reduces interfer- 
ence from polyatomic ions. 3 The electrical model of the 
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Fig. 5. Block diagram ofthe HI' 4:>ou dual-mode detection system 
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plasma and the interface region is shown in Fig. 6. When the 
plasma is coupled with the RF coil induclively. the plasma 
has only a slighi dc potential. However, there is capacitive 
coupling between the plasma and the RF coil, which creates 
a positive plasma potential oscillating at the radio frequency 
of the plasma source. 

Within the plasma, positive ions arid electrons exist, since 
the plasma temperature is high (6,000 to 10.000K). The num- 
bers of positive ions and electrons are essentially equal, so 
the plasma is electrically neutral. Since the sampling cone is 
cooled by water, the plasma temperature decreases dramati- 
cally when the plasma conies close to the cone. Positive ions 
and electrons do not exist any more and the neutral Aratom 
becomes dominant, creating a "sheath" between the interface 
and the plasma. Since the plasma potential is grounded lo the 
interface and the vacuum chamber through the shealh, il ads 
as a condenser and I he char ge buildup around the sampling 
cone results in the formation of a discharge inside the fust 
vacuum stage, commonly called the Secondary (lisrhtnyr. 
The secondary discharge ionizes molecules such as Art), 
Aril, and ArAr inside the first vacuum stage, giving rise lo 
interferences with analyte ions at I he same nominal mass. 



Fig. 6. Kin taScH model of the 
plasma and the interface region. 



When the ShieldTorch system is used, a shield plate is in- 
serted between the torch and the RF coil, eliminating the 
capacitive coupling between the plasma and the RF coil 
so that the plasma potential is effectively reduced to zero. 
As a result, there is no longer a secondary discharge anil 
polyatomic ions are not ionized behind the sampling cone. 
To reduce the polyatomic ions even further, the plasma 
temperature is reduced, since these polyatomic ions are 
also generated in the plasma itself. By lowering I he plasma 
temperature, the ShieldTorch system reduces these inter- 
ferences dramatically, resulting in improved detection limits 
down to ng/1 or ppt levels for elements such as Fe. Ca, and 
K — typically three orders of magnitude belter than without 
the ShieldTorch system. Typical spectra with anil without 
the ShieldTorch system are shown in Fig. 7. 

HP CheitiStation Operating Software 

The III' ( 'heniSlation operating software is easy to learn and 
use All instrument parameters are controlled via the IIP 
( heniSlation. unlike traditional ICP-MS systems which were 
completely manual before the introduction of the IIP -1500. 
An example screen from the HP f'hemStation is shown 



AH 



1000 



500 - 



CO, 



ArO 



ArC 



Ar 2 



(al 

1000 
500 



10 



20 



30 



40 



50 60 



70 



80 



-t ► 



- ArH 
CO, 



ArO 



— t— 
90 

Ar, 



— I 
100 



ArC 



(bl 



10 



20 



30 



40 50 

Mass lamu) 



70 



90 



— I 
100 



Fig. 7. typical spectra ofde- 

Irmi/.ed water (a) with RTtd (h) 
without tlieShielrlTurch system. 
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in Fig. S — lliis is the instrument control screen. A single click 
Of the mouse starts the entire system, while system status is 

displayed in real lime. 

The IIP ChcmStalion automates day-to-day operation by 
employing a suite of autotuning routines. Autotuning auto- 
malically optimizes the sensitivity, background level, and 
mass resolution and performs mass calibration. In the tuning 
screen, the user can select the tuning actions to be performed 
and the largel values lor sensitivity, oxide and doubly charged 
ions, and background. Three masses (typically one each at 
low, middle, and high mass) are simultaneously adjusted 
using a proprietary algorithm based on the simplex method. 1 
Each ion lens voltage is changed to increase the signal of the 
element that has the weakest relative response (ratio of ac- 
tual signal lo target value) among the three masses until all 
the signals satisfy the target values. This allows less experi- 
enced Operators to operate the instrument lo its full potential. 

Applications 

The III' 4500 (CP-MS offers high-throughput multielement 
analysis with ng/l (ppt) or better detection limits, v ery small 
sample volume requirements, robustness, and ease of use. 
Therefore, the applicaiion areas for the HP 450(1 are very 
wide, from the semiconductor industry in which the concen- 
tration of analyles is extremely low. to the environmental, 
geological, and clinical fields in which high-matrix or "dirty" 
samples are analyzed, 

Semiconductor Sample Analysis. Tin' trend towards pattern 
miniaturization and ultra large-scale integration (ULSI) in 
semiconductor devices requires I he lowering of the level of 
metallic inunuities present, hi recognition of the need for 
higher-purity chemicals to meet the needs of snbniicromeler 
device production, the SEMI Process Chemicals Committee 
has proposed several grades for each chemical. 




Fin. s. Example screen Drora d"' 
HI' 1500 GhcmStation. 



Hydrogen peroxide, IU( K is widely used to remove metallic, 
organic, and paniculate contaminants from wafer surfaces 
during the semiconductor manufacturing process. The IM )% 
must be of extremely high purity to avoid contamination of 
I he wafer surface by I he cleaning solution itself. The Specifi- 
cation for U th (30-32%) in Ihe SEMI Tier C Guidelines (Ihe 
quality needed to produce It 's whose critical dimensions lie 
in Ihe range of 0.0!) to 0.2 um or greater) stipulates that the 
maximum concentration of impuril ies should be 100 ng/l 
(ppt ) fera Suite of IS metals. Table II shows the results of 
a quantitative purity analysis of \l/)> (30%). 

Until now. recovery daia presented 10 the Process Chemicals 
Committee by member companies has involved Ihe use of 
[CP-MS followed by graphite furnace atomic absorption 
Spectroscopy (GFAAS) for Ca and Fe. The HP 4500 with the 
ShicldTorch system can determine even Fe, K, and Ca at low 
ppt levels not normally possible by quadrupole ICP-MS 
because of interferences from polyatomic ions and isobars 
such as Ail). Aril and Ar. 

Table II also shows the recovery results at Ihe 50 ng/l (ppt ) 
level. The recoveries of all of Ihe elements were well wilhiu 
SEMI Tier C Guidelines, which stipulate thai recovery data 
must be obtained showing 75 10 125":, recoveries for all 
met als. 

Environmental Sample Analysis. Concerns regarding sale 
levels of contaminants in ihe environment, particularly 
heavy metals, continue to grow. The requirement for analy- 
sis of more elements at ever-decreasing concern ral ions is 
exposing the limitations of currently used analytical tech- 
niques. ICP-MS is ihe only technique thai Offers the Improve- 
ments in sensitivity that will be demanded in the near future. 
K P MS is approved for several environmental analytical 
methods including those developed by (he U.S. Environmen- 
tal Protection Agency ( EPA). 
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Table II 

Quantitative Results for Hydrogen Peroxide (30°ot 





Concentration 


Detection 


Recovery 


Element 


(ng/ll 


Limit ng i 


(%) 


B 


188 


4 


96 


Na 


4.7 


0.5 


102 


Mg 


8 


2 


97 


Al 


g 




102 


K 


not detected 


0.02 


101 


(a 


31 


i 


109 


Ti 


U 


2 


98 


Cr 


2 


1 


101 


Mil 


1.3 


0.1 


102 


Fe 


SI 


1 


106 


Ni 


6.5 


0.6 


98 


Gu 


2.fi 


0.4 


102 


Zn 


8 


1 


101 


As 


12 


0.7 


116 


Sri 


1.4 


0.5 


102 


Sb 


3.1 


0.5 


104 


An 


7 


2 


100 


Pb 


3.4 


0.3 


98 



Fig. ii demonstrates the qualitative spectrum of river water 
standard reference material (SLRS-3). A large number of 
elements, ranging from lilhium (Li) at low mass to uranium 
((') at high mass can he clearly observed, even though the 
total analysis time was only 100 seconds. Table HI shows 
HP 4500 H P-MS quantitative results, which are in excellent 
agreement with the certified values. The dual-mode detection 
system allows the user to iKiantitatc the analyles from a few 
lens of ng/l (ppi ) lo (he mg/1 (ppm) level. 



Table III 

Quantitative Results tor River Water 





Certified 


Measured 




Concentration 


Concentration 


Element 


(ug/l) 


(u.g/I.N = 3) 


Be 


0.005 ±0.001 


0.0051 ±0.0004 


Na 


2300 ±200 


2260±30 


Mg 


1600 ±200 


1450 ± 10 


AJ 


31 ±3 


32.3 ±0.5 


K 


700 ± 100 


70O±30 


(a 


6000 ±400 


5720 ±10 


V 


0.3 ±0.02 


0.303 ±0.004 


Cr 


0.3 ±0.04 


0.303 ±0.003 


Mn 


3.9 ±0.3 


3.70 ±0.07 


Fe 


100±2 


98.7 ±0.5 


Co 


0.027 ±0.003 


0.0288 ± 0.0002 


Ni 


0.83 ±0.08 


0.769 ±0.003 


Cu 


1.35 ±0.07 


1.39 ±0.02 


Zn 


1.04 ±0.09 


1.01 ±0.00 


As 


0.72 ±0.05 


0.697 ±0.007 


Sr 


28.1* 


30.1 ±0.2 


Mo 


0.19 ±0.01 


0.193 ±0.005 


Cd 


0.0 13 ±0.002 


0.0125 ± 0.0002 


Sb 


0.12±0.01 


0.127 ±0.001 


Da 


13.4 ±0.6 


13.3 ±0.1 


Pb 


0.068 ±0.007 


0.060 ±0.003 


U 


0.045* 


0.04 13 ±0.0008 


' Nol CBitilied. mloin 


ation value only 
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Fig. 9. Qualitative spectrum of 
river water reference material. 
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Clinical Sample Analysis. The determination of toxic elements 
such as mercury (Hg), lead (Ph) and cadmium (Cd) in hu- 
mans has heen a critical issue in the field of clinical chemis- 
try from the toxicology viewpoint. In addition, since recent 
biomedical research has shown that some elements at trace 
levels have specific functions in the biochemistry of living 
organisms, the determination of trace element concentra- 
tions in human beings has also become a major issue in the 
field of nutritional study. As a result, the analysis of toxic 
elements and also many trace elements in biological sam- 
ples is required. The analyte concentration range is large, 
ranging from the trace levels normally found in the body to 
the high levels resulting from industrial exposure. Since 
medical treatment regimes for hospital patients depend on 
the analytical results reported, the analysis of biomedical 
samples is critical. Therefore, the need for fast, and reliable 
analyt ical methods and instrumentation is paramount. 

Table IV shows the HP 4500 ICP-MS quantitative results for 
human hair standard reference material (NIES No. 5) which 
was decomposed by a microwave sample preparation system. 
The concentrations of 1 1 elements analyzed were in good 
agreement for all the elements that had certified values 
(there is no certified value for As). 

Table IV 

Quantitative Results for Human Hair 



HPChemStation 





Certified 


Measured 


Detection 




Concentration 


Concentration 


Limit 


Element 


Wg) 


(M-a/g) 


(p-g/g) 


Al 


240* 


220 ±6 


0.003 


Cr 


1.4 ±0.2 


1.72 ±0.07 


0.004 


Mn 


5.2 ±0.3 


5.47 ±0.13 


0.001 


Fe 


225 ±9 


219 ±5 


0.9 


Ni 


1.8 ±0.1 


1.87 ±0.06 


0.004 


Cu 


16.3 ±1.2 


16.7 ±0.6 


0.002 


Zn 


169 ±10 


171 ±4 


0.004 


As 


** 


0.18 ±0.02 


0.02 


Se 


1.4* 


2.4 ±0.3 


0.004 


Cd 


0.2 ±0.03 


0.21 ±0.03 


0.0002 


Hg 


4.4 ±0.4 


4.52 ±0.15 


0.003 


Pb 


6.0* 


5.98 ±0.11 


0.0007 



* Nol certified, information value only 
** Not certified. 

Solid Sample Analysis. Solutions and liquids are the normal 
sample types measured by ICP-MS. Solid samples are nor- 
mally digested using mineral acids and analyzed as solutions. 
However, solid samples such as glass can be analyzed direct- 
ly using the laser ablation system. The schematic diagram of 
this system is shown in Fig. 10. A sample is placed in the 
sample cell and ablated by the beam from a Nd:YAG laser 
operating at 266 nm. The fine aerosol generated is carried 
directly to the plasma by Ar carrier gas. Fig. 1 1 shows quali- 
tative data for glass standard reference material (NIST 614). 
Group 1 and 2 elements, transition metals, rare earth ele- 
ments, and actinides can be clearly seen from a two-minute 




Power Supply 



Sample Cell 

Fig. 10. .Schematic diagram of lasor ablation systr-m. 

analysis, even though the concentration of most elements 
was at the mg/kg (ppm) level or lower in the glass. 

In addition to the bulk analysis capability shown, this tech- 
nique also has the capability to analyze sample features and 
inclusions as small as 10 um in diameter. 

Speciation Analysis. Organotin compounds have been widely 
used for a variety of commercial applications. Trialkyltin 
compounds have been used for antifouling paints for ships 
and fish traps. Dialkyltin has been used for polymerization 
catalysts. Currently, there is growing concern about their 
effects on the environment. Methods to determine the 
species of tin (Sn) and the t otal amount of Sn present 
are required, since the toxicity of organotin compounds 
varies widely with the number and types of organic groups 
attached to the Sn atom. The combination of ICP-MS and 
chromat ography has the ability to perform speciation analy- 
sis with high selectivity and sensitivity. Fig. 12 shows a 
chromatogram of six organotin compounds obtained by the 
HP 4500 ICP-MS combined with the HP 1050 liquid chroma- 
tograph. Each organotin compound was separated clearly 
within a total run time of 20 minutes. Detection limits 
obtained were 24 to 51 pg as Sn. 

Summary 

The HP 4500 ICP-MS offers high sensitivity, low background, 
a wide dynamic range, and the reduction of polyatomic ions, 
even though its benchtop size is only one fifth the size of the 
previous model. It is designed for routine use, easy operation, 
and easy maintenance. With these features, the HP 4500 is 
ideal for a wide range of applications in the semiconductor 
industry, environmental studies, laboratory research, plant 
quality control, and other areas. 
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Audit History and Time-Slice 
Archiving in an Object DBMS for 
Laboratory Databases 

Development of an object database management system allows rapid, 
convenient access to large historical data archives generated from 
complex databases. 

by Timothy P. Loomis 



The requirements I'm laboratory databases include many 
of the same real tires specified for oilier types of dal abases, 
including eiiforcenienl of a rigorous transaction model, sup- 
port for concurrent users, distributed recovery capabilities, 
performance, and security. However, the requirements differ 

from most databases by the emphasis on saving a complete 

and recov erable record of historical data for some types of 

data. This requirement conies from the regulatory overseeing 
authority of the pharmaceutical industry '»> organizations 
such as the I'.S. Government's Food and Drug Administra- 
tion or Environmental Protec tion Agency, and often. 1 1 h ■ 
legal importance of the data (patent law). Some examples 
of historical data in a chemical laboratory include prev ious 
values of lesi results, designated reviewers and approvers 
of data, methods of analysis, and ingredients used to pro- 
duce a product, It is necessary to be able to determine when 
this dala changed, who changed it. and why a change was 
necessary. 

Most laboratory database systems have tried to deal with 
historical data by adding complex logic to the application 
code to record and retrieve historical data in special tables 
that are added to traditional relational database schenias. 
While this technique works for simple schemas with a few 
objects that need to be monitored for change, its complexity 
overwhelms development, testing, and support efforts for 
more realistic databases. In short, it does not scale to the 
complex databases needed for the future. 

Keeping track of historical data became a critical design 
factor when the HIM hemStudy product was being developed 
in the laboratory information management system program 
in HP's Chemical Analysis Solutions Division. IIP Ghent 
Study controls all die information used in multiyear projects 
that determine the expiration dates on drugs. The database 
is complex with 128 types of application objects intercon- 
nected through numerous relationships. Ii is necessary to 
be able to reproduce the contents of objects and the state of 
their relationships at any time in the past to satisfy regulatory 
requirements. 

Our solution to the historical data challenges of laboratory 
databases has been to develop a database management sys- 
tem (DBMS) that provides built-in support for historical data 
for any object and for groups of objects that are connected 
through relationships. The simplicity and extensibility of this 



Glossary 

Commit. The database operation that makes changes by a user perma- 
nent m the database and visible to other users. 

Component Object. An object that is contained logically by another 
(composite) object. 

Composite Object An object that logically contains other (component! 
objects 

Exclusive Lock A database mark placed on an object on behalf of 
a user to prohibit another user from obtaining a lock or modifying the 
object 

Foreign Key A way of identifying data from a row in one table that is 
duplicated in a row in another table to logically relate the two rows 

Pessimistic Concurrency The model of database design and program- 
ming that obtains exclusive locks on dala to be updated to ensure that 
the commit operation will not identify conflicts with other users and fail 
The other end-member model, optimistic concurrency, avoids obtaining 
locks but risks a commit failure 

Rollback The database operation that discards changes by a user, 
returning the database to its state before the transaction began 

Save Point During the process of modifying objects before commit, a 
user can mark a save point and later roll back to this slate rather than to 
the beginning of the transaction, discarding later changes to objects 
Save points are removed at commit or rollback. 

Schema A description of the tables, the data within tables, and the 
logical relationships among data for a relational database. This is often 
extended to any description of dalabase data and relationships in gen- 
eral. 

Logical Transaction A collection of database modifications that 
should be implemented completely or not at all. 

Two-Phase Commit Protocol A method of commit that tries to verify 
that all databases participating in a logical transaction can implement 
their part of the transaction before implementing the changes in any 
database This is used to avoid only partially implementing a logical 
transaction in a distributed database system, 
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system are possible because we have developed a pure ob- 
ject DBMS i ODB.MS I in which relationships are themselves 
objects. Although the ( iDBMS provides main advantago for 
applications development, this article will concentrate on 
the issue of hLstorical data 

The ODBMS is implemented in C— on the HP-UX* operating 
system and Windows ' ST. 

System Overview 

Before considering the details of how historical data is man- 
aged in the database, we need an Overview of the distributed 
ODBMS lo understand how an object is created and stored. 
While tlus modular system can be configured in many ways. 
Fig- 1 presents an example configuration that is used in the 
HP GhemStudy product. 

In Fig. 1, a client is a process that incorporates C*-» class 

code that defines application objects. While the object 

created bv the application can contain any data needed in die 
application, the object is managed (locked, updated, saved) 
through the services of the generic- object manager module. 
The object manager also controls logical transactions (com- 
mit and rollback) and provides save points and other DBMS 
functions. At the object manager level, all objects are treated 
alike and no changes are required to support any new appli- 
cation object types. The client may have a user interface 
(shown as a graphical user interface (Gi l) in Fig. 1) or it 
could be an application server with code to support its own 
clients. 

The object manager can connect lo one or more object 
servers (hat control a database. The ability to connect to 
multiple object servers makes the system a distributed 
DBMS and necessitates a two-phase commit protocol in 
ensure that a transaction affecting multiple databases works 
correctly. The distributed capabilities of the ( IDBMS are 
employed for archiving operations (described below | and 
for integrating data from multiple activ e databases. 



Currently, we provide two types of objec t servers which 
differ only in the driver rode modide that stores object data. 
From the point of view of a client process, there is no differ- 
ence in the way an object is treated. The Oracle object server 
stores an object in ( )racle tables while tin- file object server 
stores the object in one or more redundant Tile structures as 
object data. While the Tile version is faster than Oracle for 
read and write by a factor of JO tolOO. some customers prefer 
the i trade version because- it conforms to (heir corporate 
information systems requirements. The file version also 
stores data more compactly and is ideal for embedded data- 
bases that are not visible to users and for databases in 
which the speed of storing and retrieving data is critical. 
Because the object data stored by either type of server is 
binary, multimedia data or a binary file can be stored by 
breaking the data into objects. Objects are also useful for 
processing a large binary data file in clients that do not hav e 
enough memory to hold all the data at once. 

Laboratory databases become so huge that it is necessary lo 
remove old data periodically from the active database and 
place it in some type of turhiw for long-term storage. Mosi 
systems have used a special storage medium for archived 
data and require that the data be dmrchived back to the 
active database for rev iew. Instead, we use the distributed 
capabilities of the < )DBMS to transfer data from the active 
database lo an archive database as a simple distributed data- 
base transaction. The archiv e dalaba.se can then be taken 
offline without limiting current operations. Fig. 1 shows an 
< trade server being used for the active database and a file 
version being used for an archiv e database. 

The object database prov ides access for C+ + object applica- 
tions but lacks facilities for ad hoc oueries and repeals that 
Can be customized by a customer. To accommodate ad hoc 
(luetics and report writers, a collection of mapped tables 
can be created that provide a more traditional relational 
database schema of the application dala. Kadi type of C++ 
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object can be mapped to its own table in the map schema 
when it is inserted or updated, but it is always read by the 
application from the object database. In practice, only some 
data in selected objects is mapped. This object-relational 
DBMS combination has proven to be very successful at pro- 
viding the customer with reporting flexibility, while preserv- 
ing the speed and simplicity of a pure object system for the 
application code. 

An example of mapping is shown in Fig. 2. The example 
considers three objects of tliree different types: Dept, Emp 
and EmpList (relationship ). A client connected to the object 
server transports binary objects to and from the server 
cache. Except for objects newly created by a client, all ob- 
jects in the cache have persistent counterparts in the object 
database and are read into the cache from this database. All 
objects are inserted or updated in the object database during 
the commit operation. At the option of the application de- 
signer, selected data from an object can also be mapped to 
the map database as shown for the Depi and Emp objects. The 
EmpList relationship object is not mapped in this example. 
Relationships are usually defined using/oreii;/? keys in 
relational schemas. 

We can see from this overview that an object is a bundle of 
data that can exist simultaneously as a C++ object in multiple 
clients, as an object in the cache in the object server, as 
object data in a database, and as mapped data in a relational 
table. Managing the relationships among these multiple 
representations of an object requires adherence to a rigor- 
ous transaction model. Many of the features necessary to 
deal with historical versions of an object are extensions t o 
controls that already exist for object data. 



Auditing Laboratory Data 

There is more to a database data item than a value that can 
be retrieved. For example, that value was created by some- 
one or some calculation, it may have been converted from 
a string representation with a specific precision, it was 
created at some date and time, it may have some application- 
specified limits that cannot be exceeded, and so on. More- 
over, the current value may have replaced a previous value, 
requiring a justifying comment, and it may be necessary to 
retrieve all earlier values of this data item. It has long been 
a requirement for laboratory databases to maintain this type 
of information associated with a laboratory measurement 
and to record a history of changes to the measurement. 
We generally refer to the process of maintaining a record 
of a value and its associated information through time as 
auditing or maintaining an audit trail. In the context of 
an object database, auditing means keeping a record of the 
history of an object and objects associated with il through 
relationships. 

Auditing database data has generally meant keeping a sepa- 
rate record or audit log of selected changes made to the 
database. For example, Oracle provides the capability to 
audit user, action, and date for access to selected object 
types but requires a user to write triggers to record changes 
to data values. While this straightforward mechanism does 
accomplish the task, its use for large and complex databases 
rapidly generates huge volumes of data that require sophisti- 
cated searching to identify particular changes of interest. 
A simple audit log of database changes is practical only if one 
hopes that it will never be needed! Audit logs are routinely 
needed in the pharmaceutical industry and will soon be a 
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coalman requirement For other industries sidled to reguia- 

lor>' oversight, such as soft w are development processes 
subject to ISO validation. Searching through a huge audit log 
is not a reasonable way to answer an auditor's questions 
about the history of an object thai may contain, or be associ- 
ated with, hundreds of component objects. 

The alternative to an external audit log is a DBMS that has 
an intrinsic- method for auditing an object and its relation- 
ships. In the next section we discuss general methods devel- 
oped to audit selected c lasses of composite objects Stored in 
an ODBMS so that the audit data can be retrieved easily. 

The subject of temporal databases has rec eiv ed consider- 
able research attention directed mainly toward extending 
the relational model and providing time-based query 
methods. 1 - The implementation presented here differs 
from these models principally by: 

• I 'sing an object model 

• I sing relationship objects together with lock-and-updatc 
propagation to synchronize the time history of related 
objects, rather than attempting to ileal with the more 
general problem of "joining" any set of objects 

• Being a working implementation for audit-trail applications 
that deals with load errors And numerous practical program- 
ming problems. 

Commercial extended relational databases such as Illustra' 1 
are beginning to provide some time-based capabilities for 
specialized data. 

Example Schema 

Auditing an object is complicated by references lo other 
objects, consider Fig. 3, which shows an abbreviated class 
schema for a division of a company containing departments, 
department offices, and employees w it liin departments. 
Relationship classes (objects) derived from the class list are 
shown explicitly in this diagram because they are important 
in auditing. (For clarity all lists are shown as separate 
classes rather Ihan as inherited base classes). A reference lo 
another Object is shown explicitly as an arrow in Ihis dia- 
gram because we will be concerned with the details of prop- 
agation of in formal ion between objects. A line lerminaled 
with a dot (as in IDKF1X orOMT modeling) 1 indicates that 
references to multiple objects can be stored An A in Ihe 
lower-right corner of a class indicates thai objects in that 
class are audited. 

Composite Objects Audited relationships should be used to 
contain the components of a composite object. A composite 
object is one that can be considered to logically contain 
other component objects in the application More precisely 
for our purposes, a composite object can be defined as one 
thai should be marked as changed (updated) if component 
Objects an' added, deleted, or changed even if the data 
within die composite Ohjed itself remains unmodified. 

In the example of Fig. 3, we will consider a Dept to be a com 
posile object because il logically contains Emp component 
objects. An EmpList object is Ihe relationship or container 
connecting the composile and its components, We consider 
Dept lo hi' a composite object in Ibis example because we 
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Fig 3. Example class schema 

implicitly include all the employees in the department as part 

of the department and want to consider the department to 
be modified if there are any changes to any of the employees. 
Alternatively, we could have considered Dept lo exist inde- 
pendent of its employees. Clearly we can sink into Ihe dark 
Waters of a long philosophical discussion here (If you change 
the engine in Ihe car is il the same car?), so Ihe design is 
best approached physically The basic question is whether 
examination of the history of a composite object should 
reflect changes to its component objects. For many complex 
objects in our products ihe answer is yes. 

Two references arc necessary for an audited relationship. 
References traversed from Dept lo Emp are called component 
references and the reverse references are called back 
references. 

Audited and Nonaudited Objects \s exemplified bj the use of 
classes derived from the list class un audited and nonaudited 
relationships, auditing can be specified mi a subc lass or an 
individual object. Moreover, il is permissible lo turn auditing 
on only after some event In the life ol an object. For the 
moment, we consider only ihe case where an object in an 
audited class is audited from inception. 

We see in Fig. :t that objects of Ihe composile Dept class 
should be audited from creation but that DeptOftice and Divi- 
sion are never audited. Semanlically. Ihis design means that 
the history of a Dept object, including Ihe composition of all 
of its component Emps, can be retrieved at any stage of its 
history. In Contrast, Ihe DeptOffice for Ihe Dept and Ihe list of 
Depts in the Divisiun can be retrieved only for their currenl 
values. 
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Audit Mechanism 

Auditing Objects. Audiiing an objeel means that all images 
Of the objeel must be maintained in the database, starting 
with the image that existed when auditing was turned on. 
In contrast, only the latest (current ) image of a nonaudited 
object is retained. Note that when an audited object is to be 
written to the database, the decision to replace the old image 
depends on whether the old one was audited. Successive 
object images generated through update will be referred to 
as revisions of the object, whether the object is audited or 
not. The revision number is used by the ODBMS to ensure 
that a client is working with the correct image of an object. 
There can be only one current revision of an object and only 
the current revision can be updated. 

The term version is used for the concept of distinguishing 
variations of an object thai can all lie current. For example, 
different versions of a glossary can exist for different lan- 
guages but each version may undergo revision to add terms 
or coned errors. An object is also marked with a com mil 
thnestainp, which is exactly the same for all objects in a 
(possibly distributed) transaction. These attributes of an 
object, along with its identifier and other data, are contained 
in a header that is prepended to the object in the database 
and maintained separately by the C++ object in the client 
object manager. 

Auditing Relationships. Auditing relationships requires some 
mechanism for recording the history' of the relationship. 
Rather than implement a database relationship mechanism 
and audit it separately from audiiing objects, it makes sense 
to implement relationships as objects themselves. Auditing a 
relationship is then no different than auditing an object. 

Deleting Audited Objects. I "eleiing an objeel becomes compli- 
cated when the objeel is audited because the objeel still 
exists in the database until the delete is committed. The 
delete action must be represented in the database somehow, 
so that the timestamp and revision number marking the end 
of its life are available. We use a pseudo-object for this pur- 
pose. Archiving audited objects, or portions of I heir history, 
may involve actually referencing and loading these pseudo- 
objects representing the delele operation, 

Update Propagation. An important objective of the audit 
mechanism should be to update the minimum amount of 
information to document a change fully. For I his reason 
we reject the simple "archive copy" approach to auditing 
whereby the entire composite objeel is copied each time 
a component changes. Thus, we should not simply make 
a copy of the entire Dept composite hierarchy just because 
an Emp changed because this produces a huge amount of 
redundant data. 

Auditing a composite hierarchy is implemented in our sys- 
tem by propagating the update of a component through the 
relationship and composite parent objects using back refer- 
ences. For example, updating member data in an Emp object 
will trigger an update in the EmpListand Dept but will not 
necessitate an update or copy of other Emps or of other 
components of Dept. It is necessary to mark composite 
objects as updated even though their member data has not 
changed because the composite they represent has changed. 
Note that there is nothing to be gained by updating a non- 
audited object that references an audited one because it 
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Fig. 4. Example object history. 

docs not have a history corresponding to the past history of 
the referenced object. Therefore, for example. Division is not 
updated when Dept changes. 

It is impractical to expect programmers to follow these back 
references each time they update an object. It is also asking 
for bugs to expect them to qualify the propagation correctly 
according lo audit state and update type. We have solved 
this problem by incorporating back references implicitly 
within relationship objects and component objects. The 
objeel manager code propagates updates automatically as 
appropriate. 

The audit contents of a database can be illustrated using 
Fig. 4, an example history of a part of the example schema 
in Fig. 3, The number shown for each object at a particular 
lime is its revision number, a simple count of the number of 
database transactions that have changed the object. We see 
that Division has not been changed since it was created. Dept- 
List was created at the same time as revision 1 but has been 
modified twice since then (when Deptl and then Dept2 w : ere 
added). Since DeptList is not audited, only the last revision 
(revision 3) exists in the database. 

The behavior of audited objects is different. Deptl and its 
Emplistl were added to the DeptList as revisions I. When Empl 
was added to EmpListl, the update was propagated lo Deptl as 
well as EmpListl so thai the revision of the composite object 
Deptl reflects a change to one of its components. The same 
thing happens when Emp2 is added. Note thai Empl is not up- 
dated in this operation, nor does the update propagate to the 
nonaudited DeptList. A subsequent update of Emp2 (revision 2) 
similarly causes propagated updates to EmpListl and Deptl. To 
make the example interesting. Emp2 has been deleted, repre- 
sented by the creation of the pseudo-object with revision 
number 3D. This objeel really exists in the database as a 
marker of the end of the life of Emp2 ( figurat ively. we hope). 
Just as for an update, this delete operation causes an update 
of EmpListl and Deptl. 

Lock Propagation. For pessimistic concurrency models it is 
necessary to acquire an explicit lock on all objects to be 
updated at commit. Consequently, the object manager 
should propagate exclusive locks in the same way that it 
propagates updates and be able to deal with restoring locks 
to their original type if the propagation should fail partway 
through the propagation. 

Audit Log Another objective is to summarize changes to the 
composite Dept object in one place. In this example, suppose 
there are several changes to each of three Emps and to some 
other components (not shown) in a single transaction. The 
update mechanism records the fact in the Dept object that 
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something changed in al least one component object in this 
tniiisaction, lint we need the AudrtLog t<'\t ohject to itemize the 
specific changes luindled in that transaction. Fig. •"! shows a 
list of AudrtLog objects tanging from Dept Each AudrtLog object 
summarizes the changes lm tin- <oni|>osiie Dept ob ject dining 
a transaction. From tlte user's point of view, a convenient 
implementation is to generate one-line entries in the log 
automatically for each change the application makes to a 
component object or the composite object, and (hen require 
the usei to add only a summary coituucnt before commit 

Object Access 

Revision and Time Retrieval \n audited object Can be ret rie v e d 
from the database by specifying either a specific revision of 
the object or by specifying an absolute lime and rinding the 
object that was current at that time. A special time token 
represents current time (also known as \( >\V in the litera- 
ture), corresponding to the most recent object rc\ ision. 
Accessing objects by absolute time requires that the conimil 
tinieslamp of an object be determined so that it corresponds 
correctly to the actions of multiple clients in a distributed 
database environment. A consistent source of time must be 
available to all clients and time must be specified precisely 
enough to distinguish two transactions on a fast network. 

An example is the best way to explain why both access 
methods are needed. A common waj to query tliedalaba.se 
history in Fig. -I would be to locate the anient Deptl and then 
ask to see each of its previous revisions. Retrieving revision 6 
of Deptl. the system would use its commit limestamp to re- 
trieve revision 1 of Empl and not find Emp2hecau.se it wits 
deleted al lliis lime in EmpListl. Moving back in lime to revi- 
sion 4 of Deptl. its EmpListl would recover revision 1 of Empl 
again and also find revision 2 of Emp2. Instead of starling 
With the current revision of Deptl. the initial query could have 
specified any absolute lime, say one somewhere between 
revisions 2 and 3 of Deptl to find revision 2 of Deptl, Mien 
the conimil liiueslaiup of rev ision 2 would be used to find 

component (lata. 

Multiple Revision Management. A consequence Of auditing 
objects is thai multiple revisii ms of the same object can exisl 
in the client cache at the same lime. This presents a number 
of practical problems for application programmers who 
need a simple mechanism for specifying ihe correct Object 
rev ision to access. We have found that extending the mean- 
ing ul locking an Object to include Cache management of old 
and current revisions of an object as well as the traditional 
meaning of granting an explicit lock on the object is a practi- 
cal solution to this problem. 

Accessing Objects through References. Mixing audited and 
uonaudiled objects in Ihe same application exposes the itn- 
plctnenicr to numerous opportunities to generate run-time 
database load errors. Despite the problems of a schema with 
both audited and nonatidiled objects, it is often necessary 
to mix Ihe I wo to avoid dealing impractical quantities of 
dala in Ihe database. A lew referenc ing rules, if they can be 

enforced, solve the problems. 
• Rule i: Current access to nonaudited objects, A nonaudited 

object must always be accessed as a rimviil-liine Object, 
meaning Ihe latest one available from Ihe dalabase. For 
example, all revisions of Dept use current lime when access- 
ing DeplOflice because old revisions of DeptOllice do not exist. 



If an old lime were specified in the access request and Dept- 
Office had not been changed, the access would succeed, but 
a lew minutes later, alter DeptOffice had been updated by 
another client and its tiniestamp had changed, Ihe same 
request would fail! 

This rule is simple enough but does introduce some opportu- 
nities for apparently inconsistent behavior. For example, if a 
report generated for a Dept uses the reference to DeptOffice to 
include its loom number, the same report repealed later on 
the same revision of the Dept could have another room num- 
ber it DeptOffice had been changed. Worse. Ihe DeptOllice could 
have been deleted from the Division causing a load error. 
These apparent problems are not the fault of the databa.se 
System but rather intrinsic in Ihe heterogeneous schema. 
They arc solved either h\ auditing DeptOffice or by indicating 
that DeptOffice is deleted by slatusdata w ithin the object 

rather than deleting ihe object 

• Rule 2: Qualified access from nonaudited to audited objects. 
As explained above, an access time or specific revision num- 
ber must be specified when accessing an audited object For 
example, the Division can reference a Dept in three ways: by 
specific rev ision, by currenl time (meaning 'he latest revi- 
sion), or by absolute lime. In practice, a user does not gener- 
ally know a specific revision of the Dept Object or a specific 
commit liuieslanip. Therefore Ihe most useful access limes 
are currenl lime or an absolute lime I he user specifies for 
some reason. 

A continuing complication when accessing audited objects 
is thai Ihe object exists al some limes bill nol others. For 
example, if we delete the Dept when il is transferred mil of the 
Division, we can't simply delete il from Ihe DeptList because we 
may need to access Ihe old Dept information in the future. 
Thus, Ihe reference to a Dept should be tested for accessibility 
before We try to load il for a specific time to avoid a load 
error. These problems are solved if we simply audit the Dept- 
List and Division. 

• Ride 3: Seir-iimesiamp access between audited objects, The 
easy and foolproof way for an audited objeel to access 
another audited objeel is for i I to use its own commit time- 
Stamp. Furthermore, ii is permissible for an audited object 

to drop a reference when the objeel is deleled (or for any 
other reason) because its previous revisions will slill have 
the reference. However, there are some complications. 

ii may be necessary for an object to access the same object 

in different ways. Suppose the DeptOffice in Fig. '-\ were an 
(filed If we create a report on a revision of Dept and include 
DeptOffice inforuiaiion, the method in Oept creating the report 

should use its linieslamp access to DeptOffice to gel ronlem- 

porancous inforuiaiion. However, if a Dept method is pro- 
grammed tO Update the DeptOffice, say with its identification 
information, ii is important that ihe current DeptOffice be 

accessed, because only a current objeel can be updated. 
As long as the Dept is updated first, limestamp access can 
be used for both but it will not work if Ihe update in Dept 
is marked after accessing DeptOffice. In general, it is safer to 
code current access explicitly when updating a referenced 
object. 

Midlife Changes ol an Object. Ii is permissible to change an 
Object {tom nonaudited tO audited at some lime in its life. 

Probably the most common reason to do this is to avoid 
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generating large amounts of dala while an ohjeel is in some 
draft slage anil being ii|ulatc(l frequently. Keep in mind ihai 
the object can he a Composite oh.jcri hierarchy encompass- 
ing hundreds of large ohjeels. ( Milv after some approval 
Stage does ilie application really want lo track the life of this 
composite construct. 

Making an object audited ma) change the rules it uses lo 

access component objects and propagate updates. By im- 
plementing these mechanisms in object manager utilities, 
the change can he made transparent lo most application 

developers. 

Schema Constraints 

The previous discussion leads to a simple rule for auditing 
classes in a schema: audit the components and relationships 
if the composite is audited. For a composite object lo truly 

represent 'he state of a component hierarchy, all the compo- 
nents and component-composite relationships beneath the 

composite nmsi lie audited v\ hen the composite is audited. 
Only then will locks and updates he propagated correctly 
and can the composite use its limeslamp to access its com- 
ponents reliahly. 

For example. Fig. shows the AuditLog as audited even I hough 
we expect to creale only a Single AuditLog revision for each 

transaction. Marking ii audited follows the rule 10 acquire 
the programming simplifications enumerated above. There 
is really no penally in ihis case, hecause storing one revision 
of an audited Object lakes no more room than storing one 
revision of a nonaudited one. 

There an- reasons lor breaking this rule. In large realistic 
systems ( in contrast lo small demonstration ones) we face 
realistic const rainls on space and often somewhat amhigiious 
application requirements. As an example, consider DeptOttice 
which is marked as nonaudited in Fig. •'!. If we assume thai 
there are good application reasons for not auditing DeptOttice. 
we have to carefully access the references between Dept and 
DeptOttice according to the Complications discussed above 
and accept the apparent inconsistencies thai these relation- 
ships may produce 

Datahase Storage 

» thject storage implementations are beyond the scope of 
ihis article, but it is worthwhile to mention a couple of con- 
siderations. First, it is not necessary to have a specialized 
database lo store audiled objects. We have implemented an 
auditing database that can use either Oracle tables or our 
own Hie storage manager. The main complications are: 
Providing an efficient access method thai w ill find an object 
current al a time that does not necessarily correspond lo a 
timestamp 

Handling pseudo-objects representing delete. 

Sec ond, it is advisable lo prov ide efficient access to Current 
objects. Because audited objects are never deleted it is not 
Unreasonable to exited hundreds of copies of an ob ject in ;ui 
old database. Most applications will primarily access the cur- 
rent revision of an object and have to stumble overall the 
old rev isions unless the storage manager distinguishes cur- 
rent and old audiled data. It may be worth introducing some 
overhead lo move the old revision of an object when a new 
revision apjiears lo maintain reasonable access efficiency. 



Some object ilalahase systems map object daia lo relational 
tables. The relational system can represent the primary 
object depository or. alternatively, only selected dala can 
be mapped lo enable customers to use the ad hoc query 
and report-writing capabilities of the relational database 

system. Fxlending these systems lo handle audited dala 
simply requires adding a revision number, limeslamp, and 
Object Status code lo the mapped dala. The ad hoc user 

should he able lo formulate the same type of revision and 

time dependenl queries of the relational database as a pro- 
gramming language does of the object database. The status 
is necessary to distinguish old audit dala. current objects, 
and deleted psetido-ohjecls. 

Archiv ing 

A lot of database (lata is created very rapidly in auditing data- 
bases. Al some point some of it must be mov ed to secondary 
storage as archived dala. As usual, auditing database 
systems pose special challenges for ihinning data WjthOul 
corrupting the remaining ohjeels. 

What Is an Archive? Several types of archives air possible 
One common repository is a file containing Object dala in 
a special format mid probably compressed. Data is moved 
to the archive using special archive utilities and must In 1 
dean-hired hack into the active database for access using 
the same special utilities. This method maximizes storage 

compactness but pays for it by a cumbersome process to 
retrieve the archived data when needed. Another possibility 
is lo move data lo a separate data pari il inn (table space) 
thai can be taken offline. Access to the archived data might 

require dearchiving or. if the complexity is tractable, onion- 
ing the archived dala with the active dala in queries. 

At the other extreme is the use of a distributed database 
system to Connect the active and (possibly multiple) archive 
databases. The archive medium, then, is just another data- 
hase that should have read only access (except during an 

archive operation by system utilities), a distributed data- 
base system connects the active and archive databases dining 
llii' archive and dearchive processes, allowing the data to be 
moved belwcen databases as a distributed transaction. 
This is [he method we have chosen lo use in our products. 
A distributed archive system allows continued growth of 
archived data while retaining reasonable access times when 
necessary. Another advantage is the reliability of the archive 

and dearchive processes because they are a distributed 
transaction subject lo two-phase commit protocols and re- 
covery mechanisms. Finally, il is possible to access archived 
dala automatically without dearchiv ing if the archive data- 
base is on line. This indirect access feature is explained 
more fully below. 

Archiving Entire Objects. The first mechanism for ihinning 
data is lo remove objects that will no longer he modified, 
(ienerallv status within the object indicates when this stale 
of life has been achiev ed or. perhaps, just the time since the 
object was last modified is sufficient Can we just remove all 
revisions of the object from the active database and put them 
in an archive record . 1 

The first problem is simply finding the old object because it 
might have been deleted. Il might not even be in the list of 
current objects in a nonaudited list. Foi example, in Fig. :t 
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We had tetter not delete a Dept or delete it from the Deptlist 
until the time conies to archive, or we will never he able to 

find the orphaned object When a r c hiving a Dept it would he 

an oversight to archive just the current Emps. What about 
the one that was deleted earlier in the life of the Dept and is 
referenced only in ;in old revision? Fig. 4 shows this to be 
the Case for Emp2 in Deptl. Evidently, it will lie necessary to 
search all the old revisions of all composite objects just to 
identify all candidates for auiiivmg. A special key field to 
identify all components of a composite to be archived is a 
big help here. 

'Hie second, admittedly mechanist ic, worry is how to remove 
an audited object, since deleting actually results in inserting 
a new pseudo-object, and we can't even access a deleted 
object at current time! Presumably some additional code 
design and implementation provides a mechanism for actu- 
ally removing an audited object and all of its old revisions, 
as well as accessing deleted objects. This operation is called 
transfer out to distinguish it Iron) deletion. Similarly, the 
database must allow transfer in of multiple object revisions, 
including pseudo-objects represent ing delete. 

Now we can move on to the problem of other objects that 
acc ess the archived object. Because archiving is not deleting, 
objects thai reference an archived object need to retain 
these references in case the archived object must be ac- 
cessed in the future. For example, we should retain an entry 
in the nonaudited DeptList for an archived Dept object even if 
it Is not immediately accessible. One solution is to place a 
status object on each relationship in the DeptList This status 
object can contain archive information. Another solution is 
to replace the archived Dept object (and its components) 
with a placeholder object that marks it as archived and 
could also contain archive information. I iilcss we want to 
start changing references in old objects, this new place- 
holder object will have the same DID (object identifier) as 
the old one. A Variation Oft the second method is to record 
archive information Within the ODHMS ami trap references 
to archived objects. 

These solutions work if the referencing object is not audited. 
Mill w hat if it is audited? Updating the current object or 
marking the status of its reference to the archived object 
may be satisfactory for current lime access but will result in 
a load error if Older revisions attempt to access the object 
using references that were valid back when the old revision 
was current. Unless we want to start Updating old revisions 
(a scary idea if we want lo trust the integrity Of audited data), 
the archiving mechanism must handle these old references 

between audited objects without modification orquaiifica- 

lion of the old references. The general solution lo the 
archive-reference problem probably must be implemented 
at the database level. The database lock or load mechanism 

must be able lo distinguish a reference to an object that 

never existed for the revision-time criteria specified from one 
lhal exisled but is now archived. The user must be notified 
thai the daia is archived without disrupting normal processes 

Incremental (Time-Slice) Archiving In s applications ii 

may nol be practical lo archive entire objects. The life time 
of some archivable Objects (actually composite Objects wild 
thousands of component objects) in some systems can be as 
long ;is five years, making archiv ing Ihe object llieoretically 
possible St some lime bill nol Very useful for reducing online 
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data on a monthly or yearly basis. Clearly a mechanism for 
archiving just the aged revisions of objects is necessary' in 

these applications. 

The best way to specify incremental archiving is on a time 
basis, because time can be applied uniformly to all objects. 
In this scenario WE could specify a list of candidate archive 
Objects and a threshold archive time, such that all revisions 
of these objects found with a commit tiniestamp equal lo or 
earlier than ihe archive threshold would be moved to the 
archive. Well, actually, not quite all of them! Since we must 
satisfy requests by Ihe active database for the objec t at the 
threshold time, we must keep the one object revision with 
a commit tiniestamp before the threshold time because' this 
revision is current at the threshold time i unless the object 
was deleted, of course). 

To implement this incremental archive mechanism, as de- 
scribed so far. the system must beep track of the threshold 
time and archive Information about the revisions of each 
object. Attempted access lo revisions extant before the 
archive lime should receive an archive error and perhaps 
supply the archive information so lhal the user knows where 
the data can be found. 

In ibis scenario, archiving probably is not a one-time opera- 
tion. What do we do with the remaining revisions of the 
Object When the archive operation is repeated a month later, 
specifying a threshold archive lime one month later than 
that in Ihe previous Operation? From a bookkeeping point of 
view, it would make sense lo simply append Ihe new archive 
revisions of an object to the old ones hi the archive and up- 
date archiv e information in the active database. In practice 
most Customers will not find lliis method any more accept- 
able than filing lax records by subject rather than dale. Most 
archive lime slices w ill be kepi as an archive record labeled 
by the date range of the data ii contains: it could he a tape 
collecting ilusl in a rack. If we needed lo append lo an 
archive whenever more rev isions of a long-liv ed object were 
archived. Ihe archive operation would eventually require 
mounting many archives. Thus, a practical archive mecha- 
nism must allow v arious revisions of an audiled Object lo be 
Scattered in multiple archive databases. 

If a single object can be contained in multiple archives, we 
must know which archive might contain ihe requested data. 
Moreover, il would be- nice to guarantee that the load re- 
quest could be satisfied if the archive were made available. 

A customer will be upset if the archive supposedly contain- 
ing the missing data is foUpd and mounted and then the CUB- 
lomer is told lhal Ihe daia still missing! Thus, it w ill be most 
convenient to retain in Ihe active database complete infor- 
mation about the range of revisions and commit timcstamps 
of an object in each archiv e. This archive record, called an 
archive unit, contains information ahoul the continuous 
sequence of object revisions of an object that were trans- 
ferred In the archive operation. 

An example of lime-slice arc hiving is presented in Fig. 5. 
An audiied object identified by ObiNum ioi has created H) 

revisions in I lie active database. At some lime in Ihe past, an 
archive database was created, designated as [995 here. The 
first lime-slic e operation moved revision I to Ihe archive 
database and left an archive record in (he active database. 
The archived object acquired a new idcnlifier, shown as 23, 
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because an ObjNum is unique only wilhin a single database. 
Subsequently, another archive operation moved revisions 2 
and 3 lo the same database, leaving another archive record. 
The following year, another archive database was created 
and revisions 4. 5. and G were archived here. 

Dearehiving and Archive Access 

Dearchive Operation. The process of dearehiving is just the 
reverse of archiving, whether the archive medium is a com- 
pressed file or a remote database. If incremental archiving 
is used and an archive record is inainlained in the active 
database; it reduces bookkeeping to dearchive an archive 
unit (group of continuous object revisions) and remove the 
archive record from the active database. It is also neces- 
sary to dearchive archive units continuously from the youn- 
gest one to the target one to ensure the integrity of the time- 
retrieval mechanism. There must be a coin InilOUS revision 
sequence from the current timestamp to the timestanip pre- 
ceding or equal to the target timestanip. 

Indirect Access to Archive Data. Of greater interest is the 
possibility that dean-hiving may not be necessary. If ar- 
chived data resides on archive databases in a distributed 
database system, it is possible for a sophisticated object 
manager to access archived data in remote archive data- 
bases and integrate it with the active data. Important advan- 
tages of this mechanism are: 

Reduced resources for the active database because 
dearehiving is not necessary 

Transparent access to archived data by ordinary users 



Fifi. 5. Time-slice archive example. 

• Reduced administration, because the archive and dearchive 
processes become simply distributed transactions without 
introducing special mechanisms into the life of a system 
administrator. 

This mechanism relies on maintenance of an archive record 
in the active database that records information about each 
archive unit placed in an archive database. The existence of 
an archive record in the active database allows (he active 
database to return ajhrirrmliiiy reference instead of a load 
error when a requested revision or time of an object has 
been archived. The reference contains the address of the 
archive database, allowing the object manager to proceed 
to indirectly load the archived object as an alias for the 
requested one. Obviously, alias objects must be marked to 
prohibit update. The object manager can take the appropriate 
action to access archived objects (or revisions of objects) 
depending on the wishes of the user and system policy. In 
our system, the object manager recognizes several access 
modes to indicate how to treat archived data for each appli- 
cation operation. 

Conclusion 

The trend towards requiring audit trails of more and more 
processes is driving new database capabilities. Old models 
of audit logging and periodic archives do not provide routine 
access to audit data and are not scalable to large systems. 
We should not view auditing as a specialized, application- 
specific capability to be overlaid on a general-purpose 
database. 
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< ibject database systems art- well-suited to implement this 
new technology because much of the technology can 1>0 
incon;>oraied efficiently within the I>B.MS. freeing the de- 
signer anil programmer from many of the new complexities 
introduced in the discussion al>ovc Ad hoc implementations 
using storeil procedures, triggers, or oilier enhancements of 
relational databases will have difficulty maiching the effi- 
ciency of systems in which auditing is an impiicil capability 

Auditing objects in complex schemas and archiving the data 
in a distributed environment are complex processes thai 
would appear lo lie difficult to implement in ordinary applica- 
tions. On the contrary, we have found that these capabilities 
can lie used reliably by application developers because most 
of the complexity can lie concentrated in the object manager 
of an ( )I)HMS and core class code. Similarly, access to 
archived data can be nearly transparent to most application 
code with judicious use of access modes and exception traps 
if the object manager implements automatic indirect access 
to archiv e databases. 

The ambitious goals of rapid access CO active data, conve- 
nient access to old data, practical database size, and reason- 
able application complexity CSfl be achieved in an internally 



audited system by careful design of a distributed database 
system- 
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Testing Policing in ATM Networks 



Policing is one of the key mechanisms used in ATM (Asynchronous 
Transfer Mode) networks to avoid network congestion. The HP E4223A 
policing and traffic characterization test application has been developed 
to lest policing implementations in ATM switches before the switches are 
deployed for commercial service. 

by Mohammad Makarechian and Nicholas J. Malcolm 



The Asynchronous Transfer Mode (ATM) is a network tech- 
nology thai ran salisfy the qualily-of-senire requirements of 
many different types of traffic. The ability of ATM to handle 
many different types of traffic and its ability to operate at 
high bandwidths position it to be one of the core technolo- 
gies behind future broadband Wide area networks and the 
Internet. To provide quality-of-service guarantees, ATM re- 
lies crucially upon avoiding network congestion. Congestion 
ran result in unacceptably large cell loss or delays. Cells thai 
are lost may have to be retransmitted, which can result in 
increased congestion. Cells with excessive delays can cause 
higher-layer protocol timers (e.g.. TCP/IP timers) to expire, 
w hich will result in even more cells being retransmitted For 
video traffic, excessive delays or cell delay variation can 
result in underflow in the video decoder buffers. This can 
cause jagged movements or Screen freezes when playing 
back the video. 

Policing is one of the key mechanisms used by ATM to avoid 
network congestion. Policing is responsible for monitoring 
the amount of traffic sent by a connection. If a connection is 
sending more than the agreed-upon amount of traffic, then 
policing can discard traffic from the offending connection. 
By preventing loo much traffic from entering the network, 
policing helps to avoid network congestion. This ensures 
that existing connections in the network w ill continue to 
receive their required quality of service. 

Policing occurs at the USeP-tWtwork interlace (UNI), where 
user traffic first enters a public network, and at the broad- 
band ISDN iniercarrier interface (B-ICI), where traffic 
crosses front one public network to another. Policing is 
known as usofje paitianeter control ( I PC) at the UNJ and 
network parameter control (NPO at the b-ici. 

Given the importance of polic ing to ATM. it is essential that 
policing be well-lesleil. Policing must lie tested both by net- 
work equipment manufacturers when developing switches, 
and by network providers when commissioning switches. 
The IIP E4223A policing and traffic characterization test 
application has hern developed to test policing implementa- 
tions in ATM switches. This product allows users to generate 
policing test traffic and to measure how the traffic is affected 
by policing. In this way the IIP B4223A can thoroughly test 
policing in ATM switches before the switches are deployed 
for commercial sen ice. The HP E4223A can also analyze the 
traffic originaling from a traffic source to determine whether 
the source is sending loo much traffic into the network. 



How Policing Works 

For ATM to meet its quality-of-service commitments, it is 
essential to reduce or eliminate network congestion. 
Congestion can result in unacceptably poor network perfor- 
mance. ATM attempts to avoid congestion by managing net- 
work resources (e.g.. transmission links, buffer space inside 
switches) in such a way I hat congestion will not occur. 
A connection will only be established if there are enough 
network resources to provide an acceptable Quality of ser- 
vice to the new connection without disrupting the service 
provided to existing connections. Once a connection has 
been established, usage parameter control (I PC) is respon- 
sible for policing Ihe connection traffic when il enters the 
network and ensuring that Ihe traffic- does not exceed Ihe 
agreed-upon traffic rate. 

Depending on the requirements of Ihe traffic source. ATM 
provides a variety of service categories, as shown in Fig. 1.' 
For example, an application such as digital voice may lie 
Suited for ( (instant hit rate (CBR) service, while compressed 
video may be suited for real-time variable bit rate (rt-YHR) 



List of Acronyms 

ATM Asynchronous Transfer Mode 

B-ICI Broadband ISDN intercarrier interface 

BSTS Broadband Series Test System 

CBR Constant bit rate 

CDVT Cell delay variation tolerance 

CLP Cell loss priority 

GCRA Generic cell rate algorithm 

MBS Maximum burst size 

NPC Network paiameter control 

PCR Peak cell rate 

PVC Permanent virtual connection 

rt-VBR Real-time variable bit rate 

SCR Sustainable cell rate 

SVC Switched virtual connection 

TAT Theoretical arrival time 

TCP/IP Transmission Control Protocol/Internet Protocol 

UNI User-network interface 

UPC Usage parameter control 

VBR Variable bit rate 

VPI/VCI Virtual path identifier/virtual channel identifier 
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Service Category ■ Characteristics Eaample Applications Traffic Parameters 

Constant Bit Rate ICBRI • Tightly bounded cell delay variation Video/audio on demand, indeo • PCR. COVT on CLP = 0 + 1 cells 

• Static amount ol bandwidth available conferencing digital telephony 
throughout a connection's lifetime 

• Low cell loss ratio 

Real-rime Variable • Seaports bursty traffic Compressed video, distributed • PCR, CDVT and SCR. MBS on CIP = 0 + 1 cells 

Bit Rata (rt-VBRI . Tightly bounded cell delay variation classroom . PCR. CDVT on CLP= 0+ 1 cells. SCR MBS on 

• low cell loss ratio ClP=0 cells 

. PCR. COVT on CLP = 0 + 1 cells. SCR. MBS on 

Non-Real-Time Variable • Supports bursty traffic Arrlioe tesetvations banking CLP= 0 cells, tagging applicable 

Bit Rate Inrt-VBRI • No cell delay variation bounds transactions 

• Low cell loss ratio 

Unspecified Bit Rate (UBRI • No cell delay variation bounds file transfer, e-mail • PCR. COVT on CLP = 0 + 1 cells 

• No cell loss ralio bounds • PCR. COVT on CLP = 0 + 1 celts, tagging 

• "Best effort' service applicable 



• The ATM Forum also defines a service category called available bit rate (ABRI Policing of ABR connections is nol discussed in this article. 



Fig. l. ATM service • ategories 

service. The service category used b\ a connection is cho- 
sen al connection setup lime. For < ach service category 
several traffic parameters are given io the network to de- 
scribe the type of traffic thai will be sent bj thfi connection. 
For switched Virtual connections (SVC's). the traffic param- 
eters are given to the network during the call setup or re- 
negotialion phase of signaling. For /remanent virtual con- 
nections (PVCs) the traffic parameters can lie specified 
manually at subscription time. The traffic parameters are 
Used by the network to police the traffic on the connection 
anil lo determine how many network resources must be 
reserved to support the connection. 

Cells in an ATM network can he given a high priority or a 
low priority. High-priority cells have the cell loss priority 
(CEP) liil in their headers set to 0. while low-priority cells 
have a CLP of 1. Low-priority cells are more likely to he dis- 
carded if the network becomes congested. Al the minimum. 

the traffic parameters declared to the network at connection 

setup lime include the peak Ceil rule ( l'( R) and Ihe cell delay 
variation tolerance (CDVT) for (LP = o+l cells, that is. for 
all cells in ihe conneclion. regardless of priority. 

The PCR is the maximum rate al which the source may gen- 
erate traffic. Tha CDVT indicates how many Lrack-to-back 

cells there may he at Ihe user-network interface. Togelher. 
the PCR and Ihe CDVT give Ihe network an idea of when 
to expect Ihe next arrival of a cell given lhal one has just 
arrived. In addition, for variable hit rale service categories, 
a traffic source can also specify a sustainable cell rate (SCR) 
and a maximum burst size (MBS). The SCR gives an upper 
hound on the conforming cell rale of a VBR conneclion. The 
MBS gives the maximum hursi size for a VBR connection, 
assuming that the cells In Ihe burs! arrive at the PCR. Speci- 
fying Ihe S( R and MBS allows Ihe network to allocate re- 
sources such as buffer space more efficiently because the 
network has more knowledge about the type of traffic that 
will he generated. 

The I PC function is responsible for ensuring thai traffic 
on a conneclion does not exceed ihe agreed-upon rate. 
The policing function in a switch first validates the VPLA'CI 
[virtual /'""' iii''iiiifirr/i'iriunl channel Identifier) of arriv - 
ing cells, and I hen delermines whether or nol Ihe cells are 
conforming to the agreed-upon PCR or SCR. Whether or nol 
a cell is conforming is determined by an algorithm called Ihe 



generic ceil rate algorithm 1 1 1< IRA ),* popularly known as 
the "leaky bucket" algorithm (Fig. - The GCRA has two 
parameters, denoted T and t. The first parameter. T. is Ihe 
emission interval and can be regarded as the expected inter- 
arrival time of conforming cells. The second parameter, t. is 
Ihe cell delay variation tolerance (CDVT) and determines 
how many back-lo-back cells are allowed. The GCRA main- 
tains a variable called the theoretical arrival lime (TAT), 
w hich giv es ihe expected arrival time of Ihe next cell. Cells 
arriving more than t units of time before Ihe TAT are consid- 
ered to be nonconforming. Nonconforming cells can be 
lagged (given a lower priority ) or discarded by the switch. 
( ells arriving t units of lime before Ihe TAT or later are con- 
sidered lo be conforming. For each conforming cell. Ihe TAT 
is updated to give Ihe expected arrival time of Ihe nexl cell 
in ihe conneclion. 

The GCRA is a telereraie algorithm used lo deline conformance An aclual UPC implHmnnln 
lion may use Ihe GCRA ot another algorithm provided thai the quality ul-survice objectives loi 
conrietlinns are met 



A cell arrives 



▼ 




A nonconloiming coll I A conforming cell 



Fig. 2; Generic cen rate (leaky bucket > algorithm 
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Par an example <>f how the GCRA can he used io police a 
( 'UK connection, suppose thai a connection has a peak cell 
rate of PCH = 8000 cells/s and a CDVT of 0 u& The GCRA 
emission interval is calculated asT = 1/PCR = 125 us. The 
GCRA ( DVT is t = 0 us. Table I shows the resultant con- 
forming and nonconforming cells assuming thai a cell on 
the connection arrives every 120 us. 



Table I 

Conforming and Nonconforming Cells 
for PCR = 8000 cells/s. t = 0 us 



Cell # 


^arrival («s) 


TAT (us) 


Conforming? 


1 


(1 


1) 


yes 


2 


120 


125 


no 


3 


240 


125 


yes 


4 


360 


365 


no 


5 


480 


365 


yes 


6 


liOO 


60S 


no 


7 


720 


606 


yes 


8 


840 


S45 


no 


9 


960 


845 


yes 


10 


1080 


1085 


no 



To see how the CDVT can he increased to allow more hack- 
to4>ack cells, suppose that the CI >VT used by the GCRA 
is increased to x = CDVT = 1 1 us. The resultant pattern of 
conforming and nonconforming cells is shown in Table II. 



Table II 

Conforming and Nonconforming Cells 
for PCR = 8000 cells/s, t = 11 us 



Cell # 


Arrival (us) 


TAT (us) 


Conforming? 


1 


0 


0 


yes 


2 


120 


125 


yes 


3 


240 


250 


yes 


1 


360 


375 


no 


5 


4S0 


375 


yes 


6 


000 


005 


yes 


7 


72(1 


730 


yes 


8 


840 


855 


no 


g 


!KiO 


855 


yes 


10 


1080 


1085 


yes 



As mentioned earlier, some service categories also have 
sustainable cell rale (SCR) and maximum hurst size (MBS) 
parameters. In this case, one GCRA is used to police lite 
peak cell rale and another ( ICRA is used to police the sus- 
tainable cell rate. The GCRA tolerance used with the SCR 
GCRA fjscr) is derived from the PCR, SCR, MBS, and CDVT 
parameters. 1 and permits a burst of MBS cells at the peak 
cell rate. The PCS and SCR GCRAs form a dual leaky bucket 
algorithm and Operate in lockstep fashion. A cell is only 
considered to be conforming if il conforms to holh GCRAs. 
If lagging is allowed, then high-priority CLP = 0 cells can be 
tagged (given a lower priority ) if they do not conform to the 
SCR GCRA. 

In addition lo its role in the UPC function, the GCRA can 
also be used inside a traffic source to ensure I hat the out- 
going cell How conforms to a particular cell rate. This is 



referred tO as traffic shaping. When nonconformance is de- 
tected during shaping, the offending cells are delayed until 
their transmission will be conforming, hi this way. traffic 
can he guaranteed conforming before it enters the network. 

The IIP BSTS Policing Application 

The IIP H I223A policing and traffic c haracterization lest 
application is designed to test ('PC implementations in net- 
work equipment and tO analyze the characteristics of traffic 
on a connection. The IIP E422:tA is pari of the IIP Broadband 
Series Test System (BSTS).- The IIP Broadband Scries Test 
System contains a number of Y'XIhus modules that allow 
testing of broadband networks over a variety of physical 
interfaces. For brevity, the HP E4223A will be denoted the 
HP BSTS polking application during the remainder of this 
article 

The HP BSTS policing application works w ith the IIP E4209 
cell protocol processor, a VXIhus module forming part of the 
HP BSTS. The cell protocol processor in conjunction with a 
line interface module can Iransmil and receive ATM cells for 
testing purposes. Received ATM cells can be stored in a cap- 
ture RAM for later analysis. The HP BSTS polic ing application 
consists of embedded software running on the cell protocol 
processor module and a software component running on IIP 
9000 Series loo or 700 workstations. 

The IIP BSTS policing application provides the following 

functions: 

• < ienerates traffic conforming lo a single or dual leaky bucket 
algorithm (GCRA). 

• (ienerates CPC lest cells, which are test cells designed for 
testing policing. 

• Makes a number of policing-i elated measurements on 
captured ATM cells, such as the number of nonconforming 
Cells and the number of cells thai were lost or lagged. 

• Makes general performance measurements on captured 
ATM cells, such as cell delay, inlerarrival time, and one- 
point cell delay variation. 

Traffic Generation 

W hen generating traffic- to lesl policing, it is important lo 
lest the limits of the GCRA being used for policing. This 
means thai il is important to generate traffic thai has the 
maximum cell rate and hurst size but is still conforming. 
Because policing is configured in a switch using the parame- 
ters of a GCRA. it is convenient lo use I he parameters of a 
GCRA when specifying traffic lo tesl policing. Consequently, 
the IIP BSTS polic ing application provides a GCRA distribu- 
tion, which allows traffic lo be generated using the parame- 
ters of a GCRA (Fig. 3). The GCRA combinations supported 
are: 

• PCR CLP = 0+1 (single leaky bucket) 

• SCR CLP = li+l and PCR CLP = 0+1 (dual leaky bucket). 

The G( 'RA distribution can be optimized to generate traffic 
based on either I he cell rate or the burst size. This allows 
independent testing of how policing implementations handle 
cell rales and burst sizes. 

Traffic that optimizes the burst size consists of repeated 
bursts of the maximum possible conforming burst size. The 
burst is at I he line rate for the single leaky bucket. Foi the 
dual leaky bucket, the buret consists of MBS cells al I he peak 
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Fig. 3. Specifying Hie generic t ell rati' algorithm distribution 
uith the HI' B4223A policing ami traffic characterization lest 
application. 

coll run-. Tlic spacing between the hursts is Iho minimum 
necessary to maintain conforming traffic. For example, sup- 
pose a dual leaky liuckel is chosen with SCR = •'.!:">%, MBS = 3 
cells. PCR = l<>(>%. ;uid CDVT = 0 ms. The generated traffic 
will consist of repealed bursts of three cells al !()()% of the 
line rale, with a gap ofsix cells between any I wo bursts. 
Because ceil transmission is quantized, the generated load is 
33.396, which is less than the SCR Traffic thai optimizes the 
burst size has very precisely controlled burst sizes, but the 
rate of the generated traffic may be less than requested 

Traffic (hat optimizes Hie cell rale consists of traffic gener- 
ated at the maximum conforming rale The traffic will con- 
sist Of an initial burst of the maximum conforming burst 
size, followed by cells al the PCR (for a single leaky bucket ) 
OK at the SCR ( for a dual leaky bucket ). For example, with 
a dual leaky bucket with S( !R = :sr>%, MBS = :i cells, PGR = 

inn",,, and ( n\ T s i) ms. the generated traffic consists of an 

initial burst of three cells, followed by conforming I raffle al 
the scr. When optimizing the cell rale, the generated traffic 
rale can be much closer to the requested rale than traffic 
that optimizes the' burst size. 

Policing Measurements 

When generating traffic to lest policing, the number of tagged 
or discarded cells must be measured. Il is difficult for lest 
equipment to make these measurements with regular user 
traffic This is because the test equipment usually does not 
knov\ how many user cells were transmitted or the original 
priority of the user cells. For Ibis reason, test cells are 
Often used to measure policing performance. The IIP BSTS 
policing application can transmit sequences of I PI ' test cells, 
each cell having the formal shown in Fig. 1. The I 'PC lest 

cells are specifically designed for testing policing w ith the 



Fig. 5. I'I'i ' test cell measurements. 

IIP BSTS policing application. The payload of each I'PC test 
cell within a sequence contains the following information: 

> SN|,.|. The number of previous lest cells in I he sequence. 

> RES. Reserved bytes, set to zero. 

> SL().|. The total number of lest cells in the sequence. The 
IIP BSTS policing application can repeatedly Iransmit 
sequences of ~A'l or 102 I lest cells. 

« SLq. The total number of high-priority lest cells in the 
sequence. 

• < K'l.P. The priority of the lest cell when il is first trans- 
mitted. The low-order bit is sel lo 0 for high-priority cells 
and 1 for low-priority cells. The remaining bits in this field 
are sel lo zero. 

>SN(|. The number Of previous high-priority lest cells in the 
sequence. 

>VN. The version number of the lesl cell format. The current 
version number is 0. 

• cr< -Hi. A cyclic redundancy check erroi code to provide 
protection and i alidaflon of the encoded payload informa- 
tion. The CRC-lti code is compuled using the polynomial 
x l« + x li! + x 5 + ^ 

The information contained in the payload of I PC tesi cells 
allows a number i>r measurements lo be made on captured 
ATM cells (Fig. •">>. These measurements include Ihe number 
of lost or tagged cells, which are measurements direclly 
relevant to testing policing. 

In addition to making measurements with I PC test cells, Ihe 
IIP BSTS policing application can measure Ihe conformance 
of traffic on a connec tion (Fig. (>). Conformance is measured 
by saving ATM cells in Ihe capture RAM and then measuring 
Ihe number Of captured nonconforming cells. These mea- 
surements can be used to test Ihe number of nonconforming 
cells detected by a switch or to test whether the traffic on a 
connection is conforming. 
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Testing Policing in a Switch 

Policing in ATM switches must work correrlly if ATM is to 
realize its potential for providing guaranteed quality of ser- 
vice for many different types of traffic. This requires that 
policing be thoroughly tested, both during switch develop- 
mettl and during switch deployment. There are basically two 
aspects of policing to he tested: conforming cells should not 
be lagged or discarded, and nonconforming cells should he 
lagged or discarded lo protect the quality of service provided 
to other connections. To test the above aspects of policing, 
the number of lost cells, the number of tagged cells, and the 
number Of test high-priority cells musl all be measured 
(Table III). 



To simplify testing, the overall philosophy when testing 
policing in a switch is to test one GCRA parameter at a lime. 
This means keeping the cell rale constant while varying the 
burst size, or keeping the hurst size constant while varying 
the cell rate. Fig. 7 shows how to test the cell rate of a single 
leaky bucket. The switch is rust configured with the leaky 
bucket parameters to be tested — in this case, lite PGR and 
CDVT for a single leaky bucket. The PCR and CDVT used lo 
generate tesl traffic are then entered, with the PCR used lo 
generate the traffic being lower than the PCH in the switch. 
The testing then iterates between measuring the number of 
lagged or lost cells and increasing the PCR. If the number of 
lagged or lost cells differs from what is expected, a potential 
defect is logged. 

Example: Testing a single leaky bucket in a switch. The ap- 
proach in Kig. 7 was followed lo test a single leaky bucket 
(GCRA) in a switch with PCR = 8 Mbits/s and CDVT = 00 ms 
on a l$5*Mbit/S S( )NET port. The HP BSTS policing applica- 
tion was used to generate traffic- consisting of a repeating 
sequence of 1021 I 'PC test cells conforming to a GCRA with 
PCR = 4% (5.8 Mbits/s) and CDVT = 60 ms. The traffic was 
sent through the switch and back into the IIP BSTS, where il 
was placed in the cell protocol processor capture RAM. The 
ratio of lost cells was calculated based on the captured UPC 
test cells. The PCR was then incremented by ().•»% for the 
next iteration of the lest. The expected cell loss ratio was I) 
if the generated cell rale was less than Ihe policing cell rale, 
otherwise Ihe expected cell loss ratio was Ihe proportion of 
generated cell rate greater than Ihe policing cell rate, that 
is (PCRoafljc - PCR,,„ii ri .yPCR| r ,,rr„- Table IV shows Ihe lest 
results. 



Configure Policing 
in Switch 



Pa 

Parameter 

Lost cells 



Tagged 
cells 



Lost high 

priority 

cells 



Table III 

rameters to Measure when Testing Policing 

Description 

Number of cells discarded or lost in the 
switch. This parameter is used to check that 
policing is nol discarding too many cells. 

Number of cells tagged (changed from high to 
low priority) by the switch. This parameter is 
used lo check that policing only changes the 
priority of a cell when necessary. 

Number of high-priority cells discarded or lost 
in the switch. This parameter is used to check 
that policing is not discarding too many high- 
priority cells. 



When testing policing in a switch with the HP BSTS policing 
application, the approach is to transmit a well-understood 
stream of test cells into ihe switch, capture the cells after 
Ihey have traversed the switch, and then calculate how many 
c ells were tagged or discarded. This approach is well-suiled 
for stimulus-response type testing to lesl the capabilities of 
the switch systematically. 



Specify Initial 
PCR. CDVT in HP 
Policing Application 



Generate 
UPC Tesl Cell 
Traffic 



Measure Tagged, 
tost Cells 



Increase PCR 
in HP Policing 
Application 
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Fig. 7. Testituj tin- I'CR (|><-:ik cell rale) of it single leaky bucket, 
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Table IV 

Test Results for a Single GCRA in a Switch 

Cell Loss Ratio Expected Result 7 



PCR for Generating 
Traffic 

4.0% 

4.5% 

5.0% 
5.5% 
6.0% 
(>.5% 
711",. 



0.00 
0.00 
().(KI 
0.02 
0.10 
0.17 
0.23 



yes 
yes 
yes 
yes 
yes 
yes 
yes 



Testing Traffic Conformance 

Although policing in network switches will protect the net- 
work from traffic sources that send too much traffic, it is 
also important for traffic sources themselves to generate 
Conforming traffic if possible. If a source generates noncon- 
forming traffic, then lite nonconforming cells will be dis- 
carded by the network and may have to be retransmitted by 
the source. This can significantly degrade the network per- 
formance experienced by the traffic source. The IIP BSTS 
policiiiR application can be used to check whether a source 
is generating conforming traffic. 

Example: Testing the conformance of MPEG-2 video traffic. This 
example demonstrates how to use the HP BSTS policing 
application to measure the conformance of a traffic source. 
As shown in Fig. 8. a laser disk player was connected to a 
commercial MPEG-2 encoder with a 4 r >-Mhit/s QS8 ATM 
output. The encoder was set up to generate MPEG-2 video 
over ATM at I Mbits/s. The user 's guide for the encoder 
stales that a ('I)VT of 100 ins should be sufficient lo com- 
pensate for (he effects of adapting MPEG-2 packets to A'I'M 
cells. 

The ATM output of the MPEG-2 encoder was first sent di- 
rectly to the HP liSTS. where the MPEG-2 traffic was placed 
in the cell protocol processor capture RAM. To verify that 
the MPEG-2 traffic was being captured correctly, the IIP 

UPC in Switch -s. 
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E4220B MPEG-2 protocol viewer test software was used to 
play back the captured \ideo segment. The HP BSTS policing 
application w as then used to measure the number of non- 
conforming cells with a PGR CLP = 0+1 GCRA The GCRA 
parameters were chosen conservatively to be PCR = 4.07 
Mbit.vs anil (DVT = 200 ms. The HP BSTS policing applica- 
tion measurements stowed ttiat the mpeg-2 encoder was 

not well-behaved, w ith approximately 25% of the cells being 
nonconforming. 

To see the effect of the nonconfornung cells on the video 
traffic, the Output of the MPEG-2 encoder was then directed 
to an ATM sw itch before being routed lo the HP BSTS. The 
sw itch was configured to police the MPEG-2 traffic with 
PCR = 4.117 Mbits/s and CDVT = 200 ms. Like the IIP BSTS. 
ihe switch detected approximately 29% of the cells as being 
noneonforniing. These nonconforming cells were discarded 
by the switch. The remaining cells passed through die switch 
and were captured in the cell protocol processor capture 
RAM. However, because of the large number of ATM cells 
that were discarded by the switch, it was not possible lo 
play back even one video frame. This example clearly dem- 
onstrates the importance of generating conforming atm 
traffic and shows how the HP BSTS policing application can 
be used lo lest the conformance of a traffic source. 

Conclusion 

Policing network traffic at the I Nl or B-ICI is crucial to 
maintaining <|uality-of-serviCe guarantees in ATM-based net- 
works. The ability to support Ihe (|uahiy-of-service require- 
ments of many different types of traffic is one of Ihe distin- 
guishing features of ATM. This feature means that A'I'M is 
well-suited to providing the backbone network for future 
broadband wide area networks and Ihe Internet The HP 
BSTS policing application enables switch vendors and ser- 
vice providers to lest policing and helps ensure the success- 
ful deployment of ATM. 

Acknowledgments 

The authors would like lo acknowledge Ihe conlribitiions of 
many individuals who participated in ihe development and 

deployment of ihe HP BSTS policing application, Including 
Brian Smiili (project manager), Lawrence < Toft (product 
design). Judith Watson (learning products). Mark Leonard 
(Usability engineer), Drew Patersou. -lack Lam, and Scoll 
Reynolds ((JA testing), and Reto Bradct (product marketing). 

References 

I ATM Forum Jiqflic Management SpectfumUtofi VkKshm 't <>. 
The ATM Porum Technical Commfttee, March ismo. 

2. The I 'HI. for the III' BSTS is http://wvww.hp.com/go/bsts 



FiK. s. Testing the conformance "i video Lraffli 



© Copr. 1949-1998 Hewlett-Packard Co. 



August I0B7 Hewlett 'Packard Journal 98 



MOSFET Scaling into the Future 



2D process and device simulators have been used to predict the 
performance of scaled MOSFETs spanning the 0.35-u.m to 0.07-um 
generations. Requirements for junction depth and channel doping are 
discussed. Constant-field scaling is assumed. MOSFET drive current 
remains nearly constant from one generation to the next and most of the 
performance improvement comes from the decreasing supply voltage. 
Gate delay decreases by 30% per generation, nearly the same trend as 
previous generations. However, this performance gain comes at the price 
of much higher off-state leakage because of the reduction of the threshold 
voltage Various solutions to this high leakage are discussed. 

by Paul Vande Voorde 



Hewlett Packard adopted CMOS technology in the mid- 
1970s. At that time the sate length l.j, was 4 um and the Kale 
oxide thickness Tox W8S 50 tun. Since then, each new gen- 
eration of technology has shrunk L K Ity ahoiil 'SI I"" and T,, x t»y 
about 25%. The decrease in L, has been lied lo the evolution 
of lithography equipment. Following these scaling trends, 
intrinsic gale delay has decreased about :$()% per generation. 
New generations o|" technology are released about every 
three years. The important principle in Ml )SFET scaling is 
thai L,.andT llx must decrease together. Scaling one without 
I he oilier does not yield adequate performance improvement. 

The performance metric for gate delay is t'V/I, where C is 
the load capacitance. V is the supply voltage (V,|,|). and I is 
the cbive current of the MOSFETs (average of NMOS and 
I'M* )S). C is composed of both gale and junction capaci- 
tance. MOSFET scaling, which decreases L K , T, 1X , and junc- 
tion area while increasing substrate doping, lends lo keep 
(' fairly constant from generation to general ion. For several 
generations of technology, the supply voltage was held con- 
stant at 5Y I constant-voltage scaling). In that era. gale delay 
was reduced by ever-increasing M< >SFET drive currents. 
Since the voltage was held Constant while the dimensions 
decreased, the electric fields continuously increased. High 
fields and high currents tend lo damage the gate oxide and 
lead to device deterioration. Thus, one of the main technol- 
ogy challenges has been to design Ml ISFETs with adequate 
reliability, 

ConStant-VQltage scaling ended as L, approached 0.5 um and 
T (1X neared 10 nm. The demands of gate oxide reliability re- 
quired that the supply voltage be reduced. This occurred as 
the peak oxide field reached roughly I MV/cm. We are now 
in an era where supply voltage is scaled along with T,, x so 
that the peak oxide electric field remains roughly constant 
(constant-field scaling). This study examines some of the 
implications for this of type scaling in future technology 
general ions. 

Process and Device Simulations 

The 2D process simulator TSl'PKEM-4 from Technology 
Modeling Associates Inc. Of Sunnyvale. California w as used 



lo simulate scaled M( )SFET device structures. The inputs to 
TSI fPREM-4 are the Implant and oxidation steps thai would 
be used in the actual process. The process architecture 
assumed is similar lo current CM< >S processes, employing 
shallow source/drain extensions and deeper main source/ 
drain regions followed by sihcidation. 

The 21) device simulator MEI HCI. also from Technology 
Modeling Associates Inc.. was used to predict the electrical 
characteristics of the device structures from TSI IPREM-4. 
Here we use field dependent mobility models that have been 

benchmarked bo the ill' c.Mi >sio process. Iterative simula- 
tions with Tsl'PREM-l and MEI >I( I were performed to deter- 
mine the requirements on junction depth and channel doping 
profile to ensure proper threshold and subthreshold behav ior. 
Fig. I shows Ihe device Structures resulting from these sim- 
ulations for each generation from 0.35 uni down lo 0.07 inn. 
For Lu less than 0.15 uin, retrograde channel doping profiles 

are needed to control the subthreshold characteristics. 

Figs. 2 through 5 summarize Ihe results of this scaling study. 
Fig. 2 shows the scaling of T,, N With bg. These two must 

scale together to get adequate performance Improvement. 

( onslanl field scaling diclales that \',\,\ uiiisl decrease propor- 
tionally lo T ov maintainhtga peak oxide field of 4 MV/cm. 
For example, this results in T, 1X = 2.5 run ;md Vpy = IV for 
the Lg = 0.1 um generation. 

Fig. -"I shows the scaling of effective Channel length ( I-hi ) 
and Ihe source/drain extension junction depth I X| I. For the 
0. l ain generation, l,.rf is aboul 0.07 um and Xj must be nearly 
50 nm. The series resistance of the source/drain extension 
must decrease e\ en as Ihe junction depth also decreases. 
This requires higher doping lev els in the extension region and 
carefully minimized spacer widths. 

Fig. -1 shows the scaling of threshold voltage (V, ), Here V, is 
kept at 20"o of Y,|,| lo maintain adequate current drive. This 
yields Y| = 0.2V for the 0. 1-um generation. Unfortunately, 
since off-State current varies exponentially with V,. reducing 
V, leads to much higher off-stale leakage current ( 100 iiA/um 
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for I he (I. I -tun generation) than in current ( '\I< >.S tech- 
nologies- Here the simulations are tailored to predict the 

nominal leakage. Worsl-easo leakage would ho approxi- 

matelji one order of magnitude higher for the 0-1-um ease. 

Hg. 5 shows the scaling of drive current and total gate 
capacitance! Because of the simultaneous scaling <if I T , T,, x . 
and V|. the curpent and capacitance do not change Mud) 



from one general inn in Ihc noxl. Therefore, t lie gale delay 
metric cv/l decreases primarily because of the decreasing 
snpph Voltage. 

Device simulators allow one to examine the internal dis- 
tributions within the device. Fig. ii shows the lateral electrii 

field along I he channel lor each of lite device Structures in 
Fig. I Even though V,|,| decreases ;ls shown in Fig. 2. the 
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peak electric Held near ilic drain continues n> increase as 
\„ A decreases, However, the width of the high-field region 

i lea-cases, giving the electrons less and less distance i<> reach 
equilibrium with the electric Held. When this "nonlocal" effect 
is included in MKI )I( I. the electron temperature can he cal- 
culated as shown in Fig. 7. Mere, even though the peak field 
increases, the electron temperature decreases as L, de- 
creases. Thus, we expect thai the reliability issues related 
lo high-energy charge carriers will liecomc less important in 

future generations of technology. 

Gate Delay Simulations 

MED >K'I w;ts used U) generate a lull set of IV curves for each 
of the devices in Fig. 1. IC-CAI', an HP soil ware product for 
modeling semiconductor devices, was then used lo extract a 
Sl'K "K model for each dev ice. < Inly the NM< IS devices were 
actually simulated. The I'M! >S models were created from the 
NMOS models with appropriate mollifications in mobility 
and series resistance lo yield half the current drive of the 
corresponding NM< >S. These device models were then used 
to simulate inverter chains as shown in Figs. S and !>. The 
load capacitance was varied lo approximate fanotlts Of 3 
and 7. Interconnect loading was ignored. The results are 



show n in Figs. 10 and 11. Fig. Ill shows that the gate delay 
improves about 3(1% per generation with the scaling de- 
scribed in the previous section. This is nearly the same as 
the historical I rend of previous generations. Note that for 
= 0.1 fttn the gate delay ( fanoul = 1 ) is less than 15 ps, 
This is faster than the best thai can currently be obtained 
with bipolar ECL Fig. 1 1 shows the dependence of gale 
delay on the power supply. The slats denote the Operating 
pOhli from const at tl -field scaling. Note thai these highly 
scaled devices offer high-speed operation even at low supply 
villages. For example, the 0. 1 tun generation should yield 
23-ps gau- delay (fanoul = 1 ) even will) Vtfd ■ ,, r 'V. This 
would be excellent for low-power applications assuming 
that the high off-stale leakage could be deall with. 

Off-State Leakage 

The previous sections show that constant-field scaling of 
M< iSFK'I's leads to a Continuation Of the historical trends of 
g8te-level performance improvement. However, this comes 
at the price of exponentially increasing off-slate leakage 
currents. For example, if an adv anced circuit had 50 million 
micrometers of dev ice w idth producing leakage current at 
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Fig. 6. Lateral electric field along the channel beginning at Ihe 
middle of ihe gate. 
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Fig. 9. Inverter switching waveforms at nodes I and 2 of Fig H 
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Threshold Voltage (V) 

Fig. 12. Drain ourrenl and off-state leakage current l,,n versus 
threshold voltage \ i for Hhe o.l-um gnierat!on< 

1 uA/um. the quiescent supply current would lie 50A. Clearly 
t his is unacceptable. There are several proposals for dealing 
wilh I his problem and I will briefly discuss some of I hem in 
this seel ion. Al this lime we do not know (he hesl way In 
ileal with Ibis problem. 

( >ne obvious solution to control quiescenl power consump- 
tion is tO put almost all I he circuit in power-down mode at 
any instant and activate only those blocks thai are being 
accessed. This system-level type of solution is beyond Hu- 
so ipe of this paper and needs to be evaluated by the design 
community. 

Another possible solution that has 1 11 proposed in\ olves 

multiple threshold devices in the same technology. For 
example, the o.l-um generation could offer FETs with V t = 
0.2V and V, = 0.1V. The low-V, FETs could be used for speed- 
critical paths and the higher -V, FETs could be used for tasks 
for which speed is not as important 

After mollifying the doping profiles in TSl'PREM-4 to get 
higher thresholds, the MEDK l simulations were repealed 

and new SPICK models extracted. Fig. 12 shows the resulting 
drive current and off-state current for various values of V, in 
the 0. l-(im generation. Fig. [3 SHOWS the gale delay as a 
function of V ( . From these graphs, FKTs with V, = 0.1V 
would yield gale delays about 80% longer than \'| = 0.2V but 
with off-state currents reduced by nearly three orders of 
magnitude. Again, the off-stale currents shown are for nomi- 
nal devices and worst-case would be higher. This approach 
is conceptually easy to implement in any technology; How - 
ever, it increases the complexity of both the process and the 
circuit design. 

Fully depleted (FD) silicon-on-insulator |S( >1 1 devices have 
been proposed to reduce off-state current for a given V,. 
These devices have a steelier subthreshold slope than con- 
ventional bulk devices, thus reducing off-state current with- 
out increasing V,. However, single-gale FI) SOI devices are 
difficult to scale into the deep submicromeler regime. Dual- 
gate FI) SOI devices scale much better but are very compli- 
cated to make. These difficulties, coupled with the maierial 
quality and availability issues, make the FI) St )I device 
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Fig. 13. Inverter detaj versus threshold voltage \ * for theO; i -im i 
generation. 

an unlikely candidate for future generations of high-speed 
digiial technology. 

If no Other solution for high l,,n can be found, then V, cannot 
be scaled lower than a certain point. For example, if one 
needed to keep l,, M (nominal ) al I n.Viini. then V, ( nominal ) 
could Q01 go below aboul 0.35V. We can apply Ibis lo Ibe 
O.l-um generation (T„ x = 2.5 nm I and resimulale the device 
wilh V, = 0..')5V. After compact model extraction and inverter 
simulations, we find that v, !t ( must be increased to l.nvto 
gel the same performance as shown in Fig. 10 for the 0.1-|im 
generation. At = 1.8V and V, = 0.35V, the device simula- 
tions predict a drive current of slightly over 1 tuA/um 
(N'M( )S). The peak oxide field would be over 7 MV7cin and 
the peak electron temperature would be aboul :Wt)0K at 
V c | = V K = 1.8V (compare to Fig. 7). Even if we could obtain 
this very high drive Current, it is questionable whether such 
a device could be created with adequate reliability. In any 
case, it is clear from t his discussion that ceasing threshold 
voltage scaling would have a crucial unpad on future device 

technologies. 
Conclusions 

We have explored MOSFET sealing into the future, extrapo- 
lating p.Lsi scaling trends in channel length and gale oxide 
thickness. This scaling requires ever-shallower junction pro- 
files and. below Lg = 0.15 um, retrograde channel profiles. 
( onstanl-lleld scaling applied to V«m and V, continues ihe 
historical trend of aboul 3094 improvement in gale delay per 
generation. In this era, MOSFET drive current remains nearly 
Constant from one general ion lo the next and most of the 
performance improvement conies from the decreasing sup- 
ply voltage. However, this performance comes at the price 
of exponentially increasing off-stale leakage. Possible alter- 
natives lo this problem were discussed briefly bin no clear 
resolution is available al ibis lime. Clearly, Ibis is an area 
where Ihe design and lechnology communities musi work 
together to develop an optimal roadmap for future device 
scaling. 
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Frequency Modulation of System 
Clocks for EMI Reduction 



This paper focuses on clock dithering as an on-chip technique for EMI 
reduction It is a survey paper based on information gathered from inside 
and outside HPs Integrated Circuit Business Division (ICBD). It reviews the 
basic concept, the work that has been done at ICBD and elsewhere. ICBD 
customer experiences, and lessons drawn from these experiences about 
design, effectiveness, and customer implementation with ICBD. 

l>> Cornells D. Hoekstra 



The proliferation of electronic products in the home and 
office is pulling increasing pressure on every product to 
reduce its eleciromagnetic interference ( KMI I. Al HP's Inte- 
grated Circuit Business Division (ICBD). several different 
methods have been used lo help deal with KMI directly on- 
chip, among them frequency modulation of the system clock, 
also called clock dithering, and control of pad output rise 
and fall times over process, voltage, and temperature ( PVT) 
variations. This latter method is also called adjustable out- 
put pad (A( )!') control, and sometimes includes program- 
mable adjustment for capacitivc loading. 

This paper focuses on clock dithering as an on-chip tech- 
nique for KMI reduction. Il rev iews the basic concept, the 
work lhal has been done al several differenl ICBD design 

centers and elsewhere, l< bi > customer experiences wiih 

lhal work, and lessons drawn from these experiences aboul 

design, effectiveness, ami customer implementation with 
l< Bi). The paper closes w ith a brief review of the cosls and 
benefits of implementing dithering and a summary of what 
customers can expect when working wilh ICBD. This paper 
does not aim to be a comprehensive description of dithering 
cireuilry anil malhemalics, bul Fathers more narrative 
description of experiences and rules of thumb. See reference 
1 for a more detailed discussion of circuit implementation, 

typical Clock Dithering Circuit 

The key idea, illustrated by Kig. L, is the conlrol fifth* fre- 
quency the voltage-controlled oscillator i V( X ') ofa phase- 
locked loop by appropriate division of the reference clock 
(RelClk) by the input divider I Q) and of the V( '» > clock (fvco) 
by the feedback divider (P). The dividers all consist of digital 
Counters. The div ided digilal waveforms are compared by 
the phase-frequency detector, which puis Qui an up or down 
signal pulse depending on w hether Ihe P waveform lags or 
leads the Q waveform. The width of tin- up or down pulse is 
proportional lo I lit* aiiioimi of lag or lead. The up or down 
pulse is feil lo Ihe charge pump and low-pass filler block, 
which iranslales il lo a change in Ihe VCt ) conlrol voltage 

(vcnti). The VCO control voltage is repeatedly adjusted by up 
or down pulses until the W i * frequency ivco is such thai the 

P and Q waveforms align ( i.e.. Ihe up and down pulses are of 
nearly zero widlh). At this point RefClk/Q = Ivco/P. Since Ihe 
V( '( ) frequency is div ided by Ihe output divider D. Ihe actual 



output signal PllClk = fvco/D. Thus, the output frequency at 
stable operation is dictated by the values of the Q. P. and D 
div iders, and by appropriate substitution can be written as 
PllClk = P( RefClk )/( QD ). Thus, for example, if RefClk = Id MHz, 
P = 50, Q = 10. and 0 = 5, the output frequency PllClk is Ihe 
same as the tapul frequency. Hi MHz. 

If ihe P counter endpoint is 49. the output frequency is 
I ).(iSMIIz. 2% less than lb MHz. Therefore, a simple way lo 
achieve dithering is to change the P counter endpoinl back 
and forth bei ween "id and 49 al some reasonable rale. 
Controlling this rate is the job of the M counler. that is, the 
P counler endpoinl is changed each lime the modulation 
counter (M) reaches its endpoinl. A typical value for M mighl 
be 10, so thai the modulation frequency is then 10 MHz/(QM) 
= 100 kHz. In practice, either the o. counter, the p counter, 
or both can be changed lo achieve differenl target frequen- 
cies. Furthermore, bj using both Ihe rising and falling edges 
of ihe V( '( ) clock, P call effectively have values such as r>0.."> 
and 19.5, thus allowing a symmetric deviation of ± 1% about 
a center value of 50, 

The scheme described above can be thought of as square 

wave modulation because ihe phase-locked loop is asked 
to jump instantaneously fiom one frequency to another. 

Because real systems don't respond lhal way. and because 

of deliberate filtering to moderate this sudden transition, the 

actual frequency modulation waveform looks more like a 
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Fig. i. Block diagram trfa typical dithering phase locked loop. 
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ringing square wave. Typical simulation results for startup, 
lock, and modulation for this design, using the SABER 
analog/digital mixed-signal simulation tool, are shown in 
Figs. 2a and 2I>. The Y'CO control voltage vcnil represents 
frequency, up and down are as described above, and p_sel is 
the output of the modulation divider (M|. Frequency devi- 
ation of i he dithered dock is typically ± l% to ±2%, and 
modulation frequency is typically 50 kHz to 2">() kHz. Cycle- 
to-cycle jitter has ranged from well under 0.&H to as much as 
2% for designs to dale. 

Square wave modulation asjusl described has been used 
successfully in a number of products to reduce KM I emission 
Sufficiently to allow products to pass FCC lesting when I hey 
otherwise could not. However, in some applications, the 
cycle-to-cycle jitter associated with i his modulation method 
cannot be tolerated hy the system (this is discussed further 
below). Over the last year this drawback has been addressed 
at ICI31) by the development of triangle wave modulation. 
This method uses delta-sigma methods to step the phase- 
locked loop Frequency more gradually from a low-frequency 
large! to a high-frequency targel and back again, resulting in 
what is usually called triangle wave frequency modulation. 
The technique greatly reduces cycle-to-cycle jitter of the 
phase-locked loop output clock compared to square wave 
frequency modulation. It also prov ides some improvement 
in EMI reduction because of Hatter spectral response be- 
tween the upper and lower frequency targets. This new 



phase-locked loop design has been successfully implement- 
ed in ICBD's CM( )S14TB process on Iwo ASICs for HI' prod 
nets. The new modulation method is less sensitive lo pro- 
cess varialion lhaii previous methods, and should therefore 
be easy lo port lo future process generations and second- 
source fabrication facilities. Fig. :i shows a closeup of the 
triangle-Wave modulation waveform of this new design 
(startup is similar to square wave modulation). 

HP Experiences with Dithering 

ICBD Customer Divisions. A number of I II' products have used 
clock dithering tO date. For one product, I wo different mod- 
ulation schemes were designed and manufactured by two 
independent organizations using different processes. For 
one design and process Ihe modulation waveform looked 
like a ringing Square wave that substantially overshot the 
targel frequencies, while for Ihe other design and process 
the modulation waveform was more triangular because of 
its use of a smaller-bandwidth filter. The advantage of the 
triangular version was thai changes in period from cycle to 
cycle were gradual (less cycle-to-cycle period varialion or 
jitler), and Ihe spectrum was smoothly spread across many 
frequencies between the minimum and maximum. Howev- 
er, because the narrow-bandwidth filter loop response was 
so slow, for worst -case slow conditions Ihe Y'l '( ) frequency 
never reached its target value, limiting total frequency devi- 
ation and thus EMI reduction. In the ringing square wave 
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version, the loop response was very fast, so thai target fre- 
quencies were reached and even exceeded because of over- 
shoot, over all process conditions. For this square wave 
scheme, the frequency was distributed over a wider range, 
although less evenly. Fig. 4 shows simulated YCO control 
voltage waveforms for these two different designs for quali- 
tative comparison. 

Conducted EMI measurements of the frequency spec! rum of 
the system clock pin showed lower peak values for the 
square-wave-modulaled pari than for I he triangular- wave- 
modulated part, which appeared to he a result of greater 
sped ruin spreading because of square wave overshoot. 
However, radiated emissions from lite boards using parts 
designed with triangular wave modulation exhibited less 
noise overall. Although lite reason for this was not proved, it 
appeared to be the resull of slower overall switching 
speeds of the process used tO manufacture this version of 
the design. Nevertheless, for parts from bolh processes, 
dithering had a beneficial effect on EMI. Fig. •"> shows con- 
ducted frequency spectra for one Of the harmonics of I he 
dithered dock measured on the clock pill of each part. 

For the product described above, the EM] reduction observed 
for square wav e modulation did ttoi seem to match the 

reduction predicted malhenialically by standard FM Iheory 



based on the deviation and the modulation rate. Therefore, 
for another product, a dithering phase-locked loop using 
square wave modulation was made programmable to a num- 
ber of different deviation and modulation values to make it 
possible to explore EMI reduction based on these two pa- 
rameters. This gave the interesting resull that EMI reduc- 
tion was optimum somewhere between very slow and very 
fast modulation, contrary lo standard FM Iheory. The rea- 
son for this is not really very Surprising and is discussed 
farther below (see "Design Considerations"). 

The complex relations described above between overall FMI 
reduction and modulation waveshape, process characteristics 

(e.g., intrinsic switching speeds), measurement method 

(e.g., conducted versus radiated spectra), and modulation 
rale caused substantial confusion and disagreement about 
which modulation method was belter for EMI reduction, 
and was a primary Stimulus for writing this paper 

Other HP Divisions. Another HI' division has taken a differ- 
ent approach. This division developed an all-digital gate- 
array pari thai receives a -10-Mllx input reference signal and 
outputs a clock thai varies pseudorandom^ near 14 MHz. 
This pseudorandom 14-MHz signal is then fetl lo an IC Con- 
taining a phase-locked loop, which smooths the pseudoran- 
dom [4-MHz signal and thus effectively generates a dilhered 
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Fig. 5. Conducted spectrum of a dithered (.'lock harmonic for (a) 
a square wave modulation design and (b) a Iriangle wave modula- 
tion design. Spectra were Tillered through a CISI'R Hi-compliant 
quasipeak detector CsBe "EMI Measurement Standards" on page 
105). 

system clock. The pseudorandom 14-MHz signal is generated 
in a deterministic way that makes ii exactly 14.31818 MHz 
on average, so that it can be used as a real-time clock. Mod- 
ulation is synchronized to the horizontal sync signal of the 
video display so that no random jitter is observed in the video 
picture. Fig. 6 illustrates the appearance of the pseudoran- 
dom 14-MHz clock compared to the 40-MHz input clock and 
the ideal 14.31818-MHz clock. Two modulation cycles are 
shown. 

Non-HP Clock Dithering Products. A standalone product that 
provides a clock whose frequency varies very smoothly over 
various ranges of center frequency and deviation is available 
in die industry. However, the product is expensive, lakes up 
board space, and requires additional surface mount parts for 
operation, further adding to board costs. In addition, the 
modulation frequency is fixed and cannot be synchronized 
with product operation, such as the horizontal sync signal 
of a video display to prevent visual distortion due to. jitter. 



The recently developed K'BD dithering phase-locked loops 
described above offer smootlily varying frequency modula- 
tion without these disadv antages, and at very low cost. 

Design Considerations 

EMI Reduction versus Modulation Waveform. A square wave 
can be described as a linear superposition of the odd har- 
monics of a fundamental sinusoid w r hose frequency is equal 
to the frequency of the square wave. Thus, a lot can be un- 
derstood about Square wave modulation of a square wave by 
considering square wave modulation of a sine wave. The 
discussion below is given with this in mind. 

FM theory predicts that the power of a sine wave whose 
frequency is modulated by another sine wave is distributed 
across individual small peaks between the minimum and 
maximum frequency endpoints. separated by a frequency 
difference equal to the modulation frequency. Thus, as mod- 
ulation frequency is decreased, there are more power peaks, 
but with lower peak values and spaced closer together. On 
the other hand, I he spectrum of a sine wave whose frequen- 
cy is modulated by a square wave contains just two peaks, 
regardless of how slowly the sine wave is modulated. These 
peaks are at the minimum and maximum frequency devi- 
ation points, and each contains half the total power of the 
unmodulated sine wave. This can be intuitively understood 
by realizing that the modulated signal spends virtually all of 
its time stabilized at one or the other of the t wo frequency 
endpoints. 

As the modulation rate is increased, a real system designed 
to do square wave modulation cannot act ually respond 
instantaneously in true square wave fashion and spends rel- 
atively more time in transition between frequencies and less 
lime stabilized at its frequency extremes. Thus, the system 
starts to look more like a sinusoidally modulated system, 
.and the two square wave power peaks tend to distribute into 
multiple smaller peaks. Finally, as the modulation rate is 
increased even further, the number of peaks will lend to 
decline again and their individual peak values will increase, 
in accordance with FM theory as discussed above. 

hi other words, in a real system designed to do square wave 
modulation there is a point of maximum EMI reduction be- 
tween very fast and very slow modulation rates. The exact 
location of this point varies depending on phase-locked loop 
and product characteristics. This phenomenon was verified 
for the product with programmable parameters described 
above. For this product's particular phase-locked loop and 
product characteristics, measurements showed that the 
grealesl reduction occurred at a point where the ratio of 
frequency deviation to modulation frequency was about 1.4. 
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Fig. 6. Pseudorandom Clock with average frequency of 14.31818 MHz. digitally generated from 4u MHz. Two modulation cycles are shown. 
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As discussed near the beginning of this article, aside from 
reducing cyclp-li>-cycle jitter, the recent development of 
triangle wave modulation at K'Hl.) also improves on the EMI 
reduction limitations of square wave modulation just de- 
scribed by more smoothly spreading spectral energy across 
the entire range of frequencies betw een the maximum and 
minimum endpoints at low modulation rates. 

Proqrammabdity It is valuable to provide modulation, devi- 
ation, and dithering on/off programmability in the phase- 
locked loop. These features should be easy to control during 
both phase-locked loop test mode and normal operation to 
allow rapid and effective evaluation of silicon and to optimize 
the product's EMI characteristics. This kind of characteriza- 
tion adds knowledge to the phase-locked loop database, and 
with appropriate programmability can also be used lo assess 
product margin over a range of fixed operating frequencies. 

Frequency Synthesis, frequency synthesis is a built-in optn in 
for dithering phase-locked loops. The same basic method 
used lo create frequencies slightly smaller or larger than the 
reference can be used tO Synthesize, nearly any arbitrary 
frequency within phase-locked loop performance limitations. 
This allows the use of lower-frequency crystals ( <20 MHz or 
sol operating in fundamental mode to generate the frequency 
reference. These crystals are typically less expensive, require 
fewer extra components, and cause fewer startup problems 
than higher- frequency crystals, which need to Operate in 
overtone modes. 

Spectrum Overlap. When deciding on deviation values, the 
designer should keep in mind the potential for spectrum 
overlap at higher harmonics. This can occur when the out- 
put frequency Is relatively low and the frequency deviation 
is relatively high, and can lend to cancel out the expected 
EMI reduction for high-frequency components. 

Mixing Dithered and Nondithered Clocks. Dithered and non- 
dithered clock domains on the same chip usually must be 
Healed as unrelated clock domains, and therefore should be 
avoided if possible. Consider the case of two clock domains, 
one dithered and one tmdithered. Given a point where the 
rising edges of the two clocks go up simultaneously, the 
edges of the dithered clock that follow vv ill alternately lead 
or lag the corresponding edges of the reference clock over 
lime. This is sometimes referred to as clock slip. < lock slip 
is both difficult lo control accurately and difficult lo mea- 
sure accurately, particularly in a production environment. 
Designing a system with asynchronous domains is typically 
messy, complicated, and hard lo simulate, so it should be 
avoided if possible. For example, in the case of video clocks, 
rather than have a nondithered clock to prevent visual jitter 
in the video, modulation can be synchronized to the horizon- 
tal sync signal. On the other hand, if separate cloc k domains 
are necessary, they can be used — they just require more 
Careful engineering. 

System Simulation. \Yc recommend that our customers simu- 
late a behavioral/structural Verilog model in their chip de- 
signs to catch unexpected problems. Examples of problem 
areas exposed when simulating with such a model are itn- 

propei multiplexing and i/< i control, Inadequate pad drive 

strength for fast (system clock speed) output test signals, 
asynchronous interfaces, and marginal performance with 
respect to operating frequency. 



A mixed-signal fie., analog and digital) simulation tool is 
indispensable for phase-locked loop design and simulation 
The tool should have links to Verilog and C. anil ideally 
should offer built-in FFT analysis capability which can be 
useful for evaluation of the spectral characteristics of varii MS 
dithering alternatives. 

Multiple Phase-Locked Loops. I arefid attention must be paid 
to systems with multiple phase-locked loops. Some major 
system EC components (e.g.. microprocessors! contain 
phase-locked loops of their own. used for things such as 
frequency synthesis and generation of nonov erlapping 
clocks. These phase-locked loops must be able to track the 
dithering of the reference phase-locked loop's output ade- 
quately so that clock skew does not become excessive 
across the system, l iifortunately. assessing this phase- 
locked loop Hacking ability in simulation can be difficult. 
A dithering phase-locked loop with very low cycle-to-cycle 
jitter (i.e., very slow ly changing frequency) can help avoid 
the need for this simulation, and is a major reason for 
ICBD's development of triangular modulation. 

Product Evaluation 

Silicon Process Variation. Process speed can be an important 
factor affecting the apparent effectiveness of dithering. For 
current designs, process speed has a fairly strong effect on 
YC< ) gain as well as clock tree and output driver switching 
speed. Thus, we recommend that our customers make a 
point of looking al both fast and slow parts w hen evaluating 
the effectiveness of dithering. 

EMI Measurement Standards. The ( IISPR1G EMI measurement 
standard is not absolute, and different measurement tools 
may all meet the standard yet give different results. The 
method by which EMI emissions are lo be measured is 
defined by a standard called ( ISPRK'.-l - This standard Is 
intended to approximate ihe Characteristics of ty pical radio- 
frequency receivers. For frequencies from -it) MHz to I (illz, 
power is averaged for a passhand whose nominal width is 
1211 kHz at (i tjft However, the standard allows passbauds 
ranging from KID kHz to 140 kHz at (> dI3. so results v ary 
depending on Ihe measurement filter chosen, Peak values 
thai change al a rate faster than 1(1 to 20 kHz are ignored 
by using a quasipeak detector defined by the standard. The 
time-constant characteristics of ihe quasipeak detector are 
again given as a range of allowable values. This ambiguity 
in both filler and peak detector characteristics means one 
should be carefitl when comparing EMI measurements from 
different tests. 

Conducted versus Radiated Spectra. It is important to differ- 
entiate between conducted and radiated spectra. When a 
probe is directly touched lo Ihe clock pin of a part, the con- 
dueled spectrum observed is a fairly direct representation of 
the spectral composition of Ihe clock signal. However, when 
electromagnetic emissions are monitored at a distance from 
a finished product, the clock signal has been significantly 
"filtered" by the antenna Characteristics of the product and 
the measurement environment. In other words, products ad 
as frequency selective antennas, so conducted and radiated 
spectra can be and usually are quite different. This increases 
the desirability of phase-locked loop programniahilily lo find 
DptlRUItti performance. Consequently, this also makes ease 
of controllability and access for measurement important 
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Effectiveness as a Function of Frequency. Dithering is intrinsi- 
cally more effective al higher harmonics and less effective 
at lower harmonics. This is simply because the absolute 
value of frequency deviation increases linearly with harmonic 
number, so thai spectral energy is spread over a larger range 
al higher harmonics, while the width of the filler over which 
spectral energy is measured is fixed. Fori una! ely, high fre- 
quency is exactly where certain customers have their most 
severe problems. For example, one HDP printer div ision tends 
to have many components on the printer board, and relies 
heavily on shielding lo limit emissions, taw-frequency noise 
seems lo be effectively contained, bill high-frequency (short 
wavelength) noise tends lo leak out through openings in the 
shielded box. For another IIP printer division, on the other 
hand, low-frequency radiation turns out lo be the primary 
noise source, al leasl in part because of unique resonani 
conditions created by printer cabling, with the resuJl thai 
dithering is Less effective. 

Dithering versus PVT/AOP. The PVTor M >P technique (de- 
fined at the beginning of this article) consists of controlling 
the turn-on times or the rise and fall times of 1C output 
drivers (pads), ideally keeping these limes constant over 
PVT variations, and sometimes adjusting for eapacilive load 
as well. This method is also intended tO keep ground bounce 
and signal reflections constant over PVT variations. It re- 
quires circuitry lo monitor the PVT operating poini of the It' 
(for example, by counting cycles of a free-running ring oscil- 
lator with respect to the reference clock), and then adjusting 
the driver or predriver current to control the drive strength 
or turn-on lime of the driver, respectively. This technique 
requires considerable effort for pad design, customer simu- 
lation, and characterization of first silicon tO accurately cor- 
relate the PVT reference to pad programming. With the prac- 
tice of sccond-sourcing becoming more common, much of 
this work has lo he repeated for each foundry, Unfortunate^ 
ly. EMI reduction has been negligible, at leasi as observed 
for HP DeskJet printers, a major user of this method. F'or 
these products, system clock noise has turned out to he a 
much greater source of EMI than pad switching noise, and 
system clock noise is nol helped by this method- I" fact, the 
low output resistance of typical PVT pads may encourage 
the transmission of system clock noise from I he power net 
mil through the output drivers and onto the product board. 
In summary, PVT or At >P is still believed to have potential 
benefit, especially for very fast-switching outputs, but is not 
likely to realize ils potential until drive control is fairly auto- 
matic inside each pad. without requiring chip-level program- 
ming intervention. 

Dithering addresses most of the limitations of PVT. The sys- 
tem clock tends to be the lop noise source because by de- 
sign everything happens at rising and falling edges of the 
clock. This means that clock dithering will lend lo spread 
out all sources of noise throughout the system, since all of 
them are related to the clock. Thus, the advantage of dither- 
ing circuitry is that it is essentially a single independent block 



that can be inserted into an If design to help reduce EMI 
globally, at both chip and board levels. 

Verifying Testability and Compatibility. < ustomers need to sel 
aside engineering time to design their K* to make the phase- 
locked loop accessible for testing and to ensure that the K ' 
will work with a dithered clock. In production testing the 
phase-locked loop internals are typically tested while the 
It" is in a special phase-locked loop lesl mode. During I his 
mode certain pins of the IC are multiplexed to the phase- 
locked loop block's input and output ports for direct access 
on a production tester, Special lest decks written by It Bl) 
are then applied lo the part. The customer needs to design 
the It' to accommodate this phase-locked loop test modi' 
configuration. The customer also is advised to simulate the 
IC design al the extremes of frequency expected from I he 
dithered clock, and lo include uncertainty in the clock edge 
to account for cycle-to-cycle jitter. Finally, recall the earlier 
recommendation that if dithering is used it should be applied 

to the entire clock domain. If that is not possible, customer 

effort will be required to design asynchronous interfaces 
that do not rely on controlled phase relations between the 
dithered and nondithered clock domains. 

Customer Evaluation. Al this point in the evolution Of clock 
dithering, customers should plan to spend some extra time 
beyond the usual EMI characterization of their product to 
characterize their systems with dilhering. As experience 
is gained both by customers and 1CBD, this need should 
decline. 
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Fully Synthesizable Microprocessor 
Core via HDL Porting 

Microprocessors integrated in superchips have traditionally been ported 
from third-party processor vendors via artwork. A new methodology uses 
hardware description language (HDL) instead of artwork. Having the HDL 
source allows the processor design to be optimized for HP's process in 
much the same way as other top-down designs. 

by Jim J. Lin 



The level of integration has been rapidly Increasing with 
advances in semiconductor technology. Many hp design 

groups have capitalized on lliis capability 10 create highly 

integrated ASICs, <>r superc hips. Superchips iiiai Integrate 

conventional ASIC logic, microprocessors, embedded RAM 
and ROM. and Other niegacell functions save cost, power, 
board space, and inventory overhead and increase I/( ) per- 
formance at the same time. This industry-wide trend has put 
an increased burden on ASIC suppliers to come up with 
megacells that are off-the-shelf, proven, and testable. As an 
ASIC supplier to other IIP divisions, we at HP's Integrated 
Circuit Business Division (ICBD) licensed several early 
microprocessors that customers demanded and artwork 
polled them into our process. All superchips built at ICBD 
today contain art work-poll ed microprocessors. 

However, artwork porting has its limitations, it often does 
not yield the best area possible m a given technology. In 
addition, the process technologies or processor vendors 
tend to diverge from ICBI I's technology for the next genera- 
tion. The testability of these processors is also a problem in 
a superchip because they require access to their functional 
pins to run parallel vectors. Multiplexing the processor pins 
with ASIC pins is a level of complication preferably avoided. 
I Customers also demand some controllability of the exact 
microprocessor configuration that goes into the superchip. 
For example, customers may choose Co increase or decrease 
the size of the cache that's included with the microprocessor 
alter profiling target code under different cache configura- 
tions. Making such changes at the artwork level is a major 
undertaking and often requires a lot of work for the proces- 
sor vendor as well. Presilicon verification is also virtually 
nonexistent and the design often requires several mask 
turns, finally, the processor is technology dependent and 
requires almost as much effort to go into a different process 
as the original port 

One methodology that successfully addresses these issues is 
HDL (Hardware Description Language) synthesis. A number 
of underlying technologies make this methodology work. 
The increased density in our standard cell technology allows 
the Implementation of dense data path functions and is area- 
effective in general. ICBD's standard cell design flow for 
top-down design methodology is robust and mature Syndic 
sis is becoming more and more powerful and is capable of 
highly complex designs. ICBD's RAM generation is also a 



key component that delivers good performance for cache 
applications. Processor vendors in the embedded market 
have shifted their design paradigms as well, They are no 
longer custom designing everything anil are using HDL and 
synthesis more and more. 

The methodology Of porting cores using HDL synthesis in- 
corporates an existing standard tool How. This is a major 
advantage An efficient tool flow is essential in today's ASIC 
market. Significant effort has gone into making ICBD's as 
efficient as possible to handle the high-integration market. 
The goal for tliis methodology is to leverage the standard 
tool flow as much as possible. In essence, processor HDL 
is transferred from the processor vendor. After necessary 
changes in certain configurations at the HDL level; the HDL 
is verified through functional simulation. It is then fed into 
the synthesis tool to obtain a net list that is subsequently fed 
into the conventional tool flow. 

The rest of this paper will discuss in more detail the method- 
ology and how it was used to implement the < oldfire 5202 
microprocessor from Motorola in a lest chip. 

Met hodology Overview 

Since tl»' processor cores developed from this methodologv 
are going to be used in superchips, they need to be designed 
with ease of integral ion and customer needs in mind. Test- 
ability, customizability technology independence, minimum 
die size, and thorough presilicon verification are all goals 
thai are important to delivering a successful core for super- 
integration. 

The testability of a processor core has different constraints 
than a standalone processor because the functional pins are 
nol visible when the core is integrated on a superchip. Pins 
of processors have traditionally been multiplexed out in lest 
mode, which takes additional effort, and fault grading the 
functional vectors is not always easy. Our new methodology 
uses full-scan test patterns that require only a few scan 
pol ls that are needed anyway lor other ASK ' logic and can 
be effectively fault graded to determine the quality of the 
vectors. This approach to testing is compatible with IIP test- 
ing standards and minimizes 'he cost of testing. 

Hav ing the DDI. for the processor means that changes to the 
processor ran be done at the source level rather than the 
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artwork level. While changes lo tlie instruction set arehitec- 
tnre are nonlrivial and are not recommended by the proces- 
sor vendor, changes to the cache and bus configurations arc 
within the realm or possibility. Customers can often save 
area by cutting out an unused block on I he processor or 
reduc ing the cache size. The functionality of the design can 
be verified before silicon lo bolster confidence for first-time 
success in the customized processor. 

Our methodology also enables technology independence 
through logic remapping. The IIDI. can be simply recom- 
piled lo target a different technology, assuming that the 
technology has ;i standard cell library capable ofsynlhesis. 
This is true for porting a processor to a new technology at 
[GBD, a second source, or a dual source. 

Area is a very important consideration as well. If the area of 
the synthesized core is not competitive with custom-laid-out 
processors, customers will not adopt this strategy regardless 
of how good I he melhodology. Standard cell density has 
increased lo the point where even dala path blocks like 
ALl ', barrel shifter, and register Hies can be implemented 
in a reasonable amount of area Another flexibility in using 
standard cells is lhat a processor core can be compiled with 
the desired large! frequency. If the nominal frequency is 
faster than the target, cells can be sized down lo save some 
additional area. 

Having the HDL source for the processors also means thai 
the design can be simulated, both at the RTL (Register 
Transfer Language) level initially and al the gate level at the 
end. Vectors can be run to verify the functionality and liming 
of I he design. 

To ensure functionality, three different approaches can be 
used, either alone or in combination. RTL and netlist verifica- 
tion can run precaplured vectors from the processor vendor 
These vectors are diagnostics, benchmarks, or random in- 
sl ructions that processor vendors themselves use. The net- 
lisi can also be compared w ith the vendor's design using 
formal verification methods. Lastly, an environment in 
which random instructions are generated can be set up 
locally to subject the design to new random inslniclion 
testing. 

Timing is verified with a combination of sialic and dynamic 
timing analysis. An even greater adv antage for the customer 
IS the ability for the entire superchip to be simulated in a 
timing-accurate fashion since the processor is in the same 
library as the ASIC logic. Previously, such system-level simu- 
lation was only possible using a hardware modeler or a soft- 
ware model that was not timing-accurate. 

Design Flow 

An It BD design automation group has a series of supported 
design tool flows. A modified version of their Standard Tool 
Flow 2 (STF2) is used to carry out our melhodology. In this 
way. our methodology leverages a proven methodology for 
doing IIDL-based design. Synthesizing microprocessor cores 
becomes an extension of the current capability. 

This paper Ionises only on those aspects of I he methodology 
that are unique to polling microprocessor cores. The process 
is illustrated in Fig. i. The methodology incorporates stan- 
dard ICBI) tool flows. For Verilog-based designs, the STF2 is 



used. For VHDL (Very High-Speed Iniegraied Circuit Hard- 
ware Description Language) designs, an IIP proprietary 
VHDL tool How is used. These Hows are simply encapsu- 
lated as design processes in Fig. I. 

Inputs. The processor vendor needs to have the following list 
of items to feed into our tool flow: 

• Design Specifications. Timing, functionality, and pin de- 
scripiions of the processor. Most of ihis information can be 
Obtained b om a data book if available. For newer cores a 
data book may not be available, but internal documentation 

that will eventually be pan of the data book will suffice. 

• Behavioral HDL. A Verilog or VHDL model of the processor 
core. This IIDL model does not necessarily have to be syn- 
Ihesizahle. As long as ii models the cycle-to-cycle behavior 
of the design, ii can be made synthesizable with some 
rew riling of I he IIDL. 

• Functional Vectors. Verification vectors in Verilog or VHDL 
formal. These are run on their respective simulators. They 
can also be translated so that they can be run on Ihe testers 
as well. These vectors enable presilieon verification. 

• Synthesis Scripts (optional). These are used lo synthesize 
the IIDL into standard cells. They are only available for 
cores that have been previously synthesized. 

Outputs. For use in superchip integration and product proto- 
typing, the ICBI) CIM team provides design groups and 
customers with Ihe following: 

• Megacell. Processor core with all requirements for inclusion 
in the HP intellectual properly library. Deliverables include 
ERS. gale-level netlist, behavioral model, dala sheet, etc. 

• Test Chip Dala. Mask, packaging, lesl vectors, and test pro- 
grant input for test chip fabrication arid test. A core without 
a lesl chip will not have Ihis. 

Rewrite HDL for Synthesis. The HDL thai is transferred from 
the processor vendor may nol be synthesizable since il may 
have been written only as a model, not as a synthesis source. 
Current synthesis tools have [imitations on the type ofllDL 
Constructs allowed and yield very poor-quality circuits for 
III >L not written with synthesis in mind. ICBI) has a set of 
HDL coding guidelines that need lo be follow ed when re- 
writing Ihe IIDL. This slep may warrant some iterations in 
synthesis to explore the optimal mapping of specific behav - 
ioral COIIStrUCtS. Regardless Of What changes are made to 
the HDL or ihe quality of synthesis achieved, the changes 
should not alter Ihe functionality. In fact, any changes should 
be thoroughly verified by running regression vectors as 
discussed next. 

Behavioral Verification. Even though behavioral verification 
is a part of STP2, this portion of the flow focuses on the 
extra verification thai results from receding for either cus- 
tomization or synthcsizability. The verification is an extensive 
simulation of Ihe altered HDL code with Ihe vendor-provided 
vectors, hi the future, Ihis step may be augmented by formal 
verification. This subject will be revisited when the verifica- 
tion of ihe test chip ( discussed later) is analyzed. 

Create and Modify Synthesis Scripts. The purpose of ihis task 
is to create Synopsys Design Compiler scripts that can be 
used to compile the standard cell portion of Ihe core consis- 
tently and systematically. AS mentioned earlier, there may 
not be existing synthesis scripts if the processor has never 
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been synthesized before, in other cases, there will bo a full 
suite of sc ripts along with design constraints. In still other 
designs, portions of the control logic will have synthesis 
scripts arid the data path will not. This represents the design 
methodology of many processor vendors. 

In ease synthesis scripts need lo he created, design specifi- 
cations and the actual 111)1. are the best sources. [CBDhasa 
generic synthesis script that can be used as a template to 
help create these scripts. Even when all of the scripts exist, 
modifications are often needed to make them work in K'BD's 
environment and libraries. The processor vendor's approach 
to synthesis may not be the most efficient approach in K'BD's 
technology. Furthermore, the vendor's approach may not 
use the most up-to-date synthesis technique available in the 
latest release of the synthesis tool. Many Irial-and-crror runs 
of different approaches may be needed lo determine the best 
synthesis approach. 

Create Quicktest Specification and PADLOC File. Quickies! gen- 
erates a lest template given a target tester. The PADLOC 
(pad location) file is an input lo Quickies! and other tools 
such as the router. This step is only needed if a test chip is 
planned for I he particular processor core. The need fora 
lesl chip is determined on a core-by-core basis. The first 
core in an architecture, a core with major customization, 
and a customer need for prototyping are all reasons for a 
test chip. 

The Quicktest specification is used to generate a test pro- 
gram for the processor core lesl chip. The Quicktest specifi- 
cation documentation describes how lo create the Quicktest 
file from the design specification. The PADL4 >C file is used 
to place the test chip pads around the core logic. II contains 
placement data for the pad ring. The i'AI )!.( )( ' specification 
document describes how to create a PADLOC Tile from the 
design spec ification. 

Identify and Create Custom Modules. \<n all blocks in a micro- 
processor can be implemented as Standard cells, although 
(he list of such blocks is becoming shorter and shorter. This 
lask Identifies all blocks within the core thai should be im- 
plemented as custom logic and creates the identified blocks 
along wilh all the models and infonnalion required to use 
them in the downstream Standard c ell design methodology. 

To identify blocks that should be custom, the HDL and 
design Specification must be studied along wilh the design 
goals. Blocks are custom-designed to meet area or perfor- 
mance goals unachievable with standard cells. Typically 
these blocks will be limited to memory arrays. However, 
they may also include structured data path logic- for imple- 
menting highly regular or speed-critical circuits, such as 
a large multiport register file, multiplier, or barrel shifter. 
This step may involve feasibility studies in which candidate 
blocks are partially designed or estimated for both standard 
cell and custom Implementations. The following items must 
be produced for each block selec ted Cor custom implemen- 
tation: 

• Verilog or VHDL model with timing 

• Synopsys liming with pin timing 

• CeJIS LEF file (Cell3 is an automatic- place-and-route tool 
from Cadence) 

• Artwork database 



• Sunrise (an automatic test vector generator from Sunrise) 
or ATG (an in-house HP lool similar lo Sunrise) fault model. 

Translate Vectors. By translating simulation-based verification 
vectors into Guide formal, the standard path lo testers and 
later poll ions of the tool How is established. Guide is an 
in-house IIP tester independent vector translation tool. This 
step may require that custom programs or scripts be written 
and supported to translate from unknown formats. There- 
fore, it is advisable lo require the processor v endor to supply 
vec tors in some known format like Verilog. for which there is 
a clear path to Guide, 'Die vectors are needed for functional 
and diagnostic purposes only. Manufacturing lesl will not 
run these vectors and will rely on 1,1,1,,, sluc-k-at, anil at -speed 
Scan testing. 

Execute Standard Tool Flow 2 (STF2). As mentioned above, 
Standard Tool Flow 2 is a design flow supported by an K'BD 
design automation group. It includes Verilog simulation, 
Synopsys synthesis. Ce03 place and route. ATG and Sunrise 
full-scan testing, artwork and mask generation, and Quick- 
test and Guide test program and vector creation. There is 
extensive documentation on the entire tool flow. A variation 
of STF2 is STF-'i. which supports partial-scan testing. This 
may be an option for cores that would realize significant 
area savings from it without Ihe loss of appreciable lesl cov- 
erage. The VHDL tool flow is deriv ed from an HP proprietary 
VHDL design How. So far. no core has been developed using 
Ibis design How. 

Package Core as Megacell. The final step In the process is to 
create Ihe data necessary to offer Ihe core as a megacell to 
IIP customers and K'BD design cenlers. The requirements 
for this type of product are currently being defined. The 
release of any core will adhere to Ihe standards set up and 
provide all the models, documentation, and support required 

Test Chip Experience 

The Coldfirc test chip ( Fig. 2) is a lesl chip implementing 
Ihe Colclfire 5202 processor from Motorola. ( 'oldfire is a 
new line of embedded microprocessors that improves per- 
formance over the 68000 architecture while maintaining 
compatibility with most of the 68000 instruction set and 
minimizing area. The < 'oldfire 5202 has a 2K-hyle 4-way set 
associative cache, a debug unit, and JTAG capability (IEEE 
1 140.1 boundary scan lesl capability) along wilh the core as 
implemented by Motorola. 

Design Transfer 

The ( 'oldfire test chip team received a brief course on Ihe 
architecture and a tape of the HDL source for the ("oldfire 
5202. The tape contained synthesizable HDL for every block 
in the design. The cac he memory blocks and some lag main- 
tenance logic were custom-designed at Motorola and had 
only a behavioral representation. Each synthesizable block 
also had a constraint file used in Synopsys. There was also 
a top-level synthesis script. This represented all that was 
needed to get started with the port Subsequently. Motorola 
has sent lesl vectors and fielded questions from K'BD. The 
level of support at Motorola has been very good. 
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Fig. 2. The Coldfire test chip floor plan. 
Making the Coldfire Test Chip 

The Coldfire test chip has been targeted as a test vehicle for 
diis methodology and offers early prototypes for customers 
interested in using the Coldfire 5202. The Coldfire test chip 
is implemented in a 0.5-um technology and takes advantage 
of the process's increased density. A fairly high frequency 
of 50 MIlz was targeted to show the scalability of the 
methodology. 

Custom Modules. The team knew al the outset thai Motorola 
had designed custom cache RAMs and tag logic and that 
custom designing caches for every core would not fit into 
our methodology. The generated WEST SRAM available 
from an HP RAM group provided the solution. These RAMs 
are designed for ASIC integration and good performance 
However, to use these RAMs, there were two extra require- 
ments that were specific to this cache. First, it had to be 
byte- writable, thai is. each byte of a miiltibyle word must be 
written separately. The second requirement is that the bits 
of one column needed to be reset in one cycle. This is used 
to invalidate the cache valid bits al startup or in case of an 
invalidate instruction. The RAM group took the first request 
anil incorporated byte writability (actually hi! writability as 
implemented in the artwork) into the generator. The one- 
cycle invalidate was not changed in the generator. Instead, 
the valid bits were I turned into nip-flops on this test chip for 
schedule reasons. This incurred an area penalty and will 
hopefully be fixed in the RAM in the future. 

Rewrite HDL. The interface in ihe cache RAM was asynchro- 
nous in the Motorola implementation. The WEST SRAM is 
synchronous. This means that addresses, data, and Other 
control signals need to be set up before and held until alter 
the rising edge of the clock. Motorola latches the various 
signals mi the rising edge of Ihe clock before feet ling them 
to the RAM. With this scheme, the signals would have 
missed ihe setup lime on Uie WEST SRAM. If |» latch is 
used, Ihi-u ihe hold lime cannot be met. As a result, negative- 
level-sensilive latches are used to provide the necessary 



hold requirement. The control signals that Motorola uses are 
sufficient to generate the WEST SRAM control signals. The 
4-way set-associatively can be easily implemented as four 
data RAMs and four tag RAMs, each representing one way 
in the cache orgaiuzation. 

Another change made to save some area was the removal 
of the JTAG block. The JTAG methodology does not fit into 
the superintegralion process since this process has its own 
mecltanism of doing boundary testing. The .ITAG logic is a 
fairly modular block thai can be stripped out with minimal 
perturbation to the rest of the design. The necessary logic 
needed in place of the JTAG block is encapsulated in its own 
level of hierarchy. The necessary JTAG-like functionality 
has been replaced by a scan wrapper that better fits ICBD's 
superchip test methodology. 

One final change is in Uie area of testing. Motorola uses a test 
mode called ad-hoc mode to test the cache RAM circuitry. 
tCBD uses BIST (built-in self-test) instead. As a result, the 
logic that makes the cache RAM controllable and observable 
and interlocks the pipeline is no longer needed. By rendering 
the ad-hoc mode inactive, all logic associated with this test 
mode can be niinimized. 

Synthesis Script Modification. The synthesis script that came 
from Motorola put highly detailed constraints on each block. 
The reasons are twofold. Motorola was concerned with the 
speed of the synthesis job if the entire design were compiled 
at once and wanted to be able to do most of the compilation 
at the block level. Secondly, Motorola had good ideas about 
where Synopsys should be spending its time and wanted to 
influence the tool in that direction. A makefile generator was 
used to piece together the numerous block-level compiles 
and Iheir respective constraint files. This approach does not 
necessarily yield the best design in ICBD's technology, as 
was found during initial synthesis trials using Motorola's 
script. The block-level constraints were often unrealistic 
and Synopsys was spending optimization cycles on the 
wrong circuits. As a result, the synthesis scripts were over- 
hauled to put constraints only at the top level, that. is. the I/O 
specifications of the chip. 

A hierarchical compile at the top level replaced the block- 
level compile. As a result, Synopsys has more freedom in 
partitioning the time. The compile time is King but not intol- 
erable, The entire synthesis job from reading in HDL to out- 
putling an optimized netlist at. 50 MHz takes 48 hours on an 
HP H000 Model 755 server. For fast turnaround needs, such 
as testing a quick fix of a bug, block-level constraints 
obtained from hierarchical characterization and the write 
script of the previous synthesis run can be used to compile 
at the block level. To get even better area, the entire design 
has been compiled with Ihe hierarchy flattened so that inter- 
block optimization can be performed during synthesis. 

Verification. All Ihe changes mentioned earlier have been 
simulated extensively to make sure that the desired func- 
tionality is achieved without having broken some other part 
of the design. Once the functionality is determined, then the 
HDL is synthesized to obtain a netlist from which both static 
timing and dynamic liming analysis are done. The majority 
of Ihe emphasis in timing verification centers on static lim- 
ing analysis. Synopsys' liming analyzer is used to generate 
liming reports. False paths and multicycle paths have been 
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carefully reviewed to make sure that there is no escaped 
path in the report. Both maximum and minimum paths are 
reported lo expose possible setup and hold violations. 

The vectors run on the design include benchmark, diagnos- 
tic, anil HIS vectors from Motorola. Motorola has developed 
a sophisticated random instruction sequence (BIS) genera- 
tor that can be tuned lo generate instructions in an area of 
interest along with random interrupts and exceptions lo 
perturb the processor. In future Coldfire cores, the ability lo 
generate HIS vectors will be incorporated into the verifica- 
tion process. This time. Motorola has generated all the HIS 
vectors anil sent litem to ICBD. Verification using a more 
formal method Qf binary decision diagram comparison has 
also been pursued using Motorola's in-house tOOl. This step 
will not be available for every core since most processor 
vendors do not support this methodology. 

Simulation of the net list had some hurdles. One is the inabil- 
ity of the net list to reset properly. This problem has its tools 
in the way the reset logic was done in the HDL. Instead of 
using ait explicil reset inference on all flip-flops, the reset 
logic became part of the input logic. Depending on where 
the reset was structured in the logic, it might or might not 
cause a particular flip-flop to reset correctly. In fad, this 
problem is more general. Every time a reset-like signal 
is used, unknown slates (Xs) are not guaranteed lo be 
suppressed. Unknown states are periodically introduced 
into the design by captured vectors that use uninitialized 
memory for operations. For example, an uninitialized slack 
memory may be used lo fill a cache line and pushed back lo 
memory. Granted, this is a mere simulation issue. However, 
it makes verification harder because only after these issues 
are fixed can other real problems be visible, because prob- 
lems I hat would have been masked can then be caught. 

Motorola will restructure their HDL to avoid this problem in 
the future. However, for the Coldfire test chip, several steps 
have been laken lo remedy the problem. Kvery flip-flop and 
latch in the design is reset using a force-and-release pair 
upon startup. W hen unknown stales are introduced into the 
system, the pattern is intercepted and given a random value 
instead. Since unknown states are essentially conditions in 
which the state of one or more bits is unknown, randomizing 
these patients effectively gives an unknown pattern without 
the simulalion nightmare. 

Test Strategy. To make the test work with STF2 as mentioned 
earlier, the core is full-scan except the register file, the in- 
struction buffer, and the latches in front of the cache RAM. 
The latches can also be tested using Ihe lalesi methods from 
ATG and thus offer virtually no degradation of the test cov- 
erage. The cache RAM has BIST circuitry testing all eight 
RAMs in parallel. The BIST mode is encoded in the test 
mode pins available on the Coldfire 6202. A limited number 
of functional vectors run in v erification are also ported to 
the testers. 

Technology Independence. The entire core is technology inde- 
pendent. The only technology dependent portion of the 
Coldfire test chip is the pads. Since the prototypes are tar- 
geted to be used in Motorola's BV evaluation boards, the -i\ 
and 5V pads in the HI' CMOS1 I library are used. Since only 
I/O pads exist, input and output only pads are made by lying 
off I he appropriate enable signals. The pads are instantiated 



only for the tesi chip. Synihesizable HDL versions of the 
pads do exist and can be synthesized to buffers when Ihe 
megacell becomes available. 

Results 

The Coldfire lest chip is the first trial of the proposed meth- 
odology. The performance target of 50 MHz has been met 
with no custom cells or modules. Both small die size and 
high lesi coverage were achieved by this chip. Higher row 
Utilization is only limited by extreme congestion spots like 
the barrel shifter, register file, and pipeline control block. An 
even smaller version is possible with a few modifications in 
key areas. In addition, future versions are not expected to 
have the overhead of the invalidate registers implemented 
as flip-flops. The gates may also be resized to meet Ihe actual 
target frequency, 

The desired changes have been successfully implemented. 
Motorola's custom cache was turned into synthesizable con- 
trol and generated WEST SRAM. The JTAG has been removed 
with minimum changes to the original IIDL. BIST anil lest 
circuitry have been added. All of these changes have been 
verified at the functional ;ui«l net list levels. Being able lo 
make changes al these levels and maintain high confidence 
in the design is an invaluable advantage with litis approach 
that would not have been possible wilh artwork porting. 

Daia management thai Is needed to maintain the coherency 

of Ihe design is an important aspect of Ihe project thai cannol 
be overlooked. Problems in this area occurred fairly early in 
Ihe project. Scripts were written lo make use of lineup files, 
thai is. lists of designs with specific revision numbers that 
go together for a particular simulation or synthesis run. 
Changes that are not yet released are made in private direc- 
tories that can be pari of a private lineup file. The massive 
verification effort requires jobs to be run at every available 
time, using every available open Verilog license. Scripts have 
also been Written to use III' Task Broker to get maximum 
efficiency of the available resources. 

Conclusion 

Porting processor cores using the new ICBD methodology of 
Standard cell synthesis has been shown to be a viable alter- 
native to the traditional artwork port HDL porting has Ihe 
advantages of teslabilily, technology independence, custom- 
izability, efficient area use, system simulation capability, and 
presilicou verification. It is also a straightforward methodol- 
ogy to support since virtually all components of it are already 
in use in Ihe HP Standard Tool Flow 2. 

The approach has its disadvantages. It cannot be applied 
indiscriminately on any processor core. Many cores de- 
signed today still do not have synthesizable HDL. The syn- 
ihesizabiliiy of ihe core may also run ihe gamut from being 
very easy to extremely difficult, depending on a host of issues 
such as clocking strategy, coding style, and architect tire 
complexity. The need for customization puts even higher 
expectations on the quality of the HDL. Trying lo change the 
functionality of a design written with raw Boolean equations 
and flip-flop instantiations is almost as daunting as editing a 
netlist Therefore, the selection of a microprocessor vendor 
may depend on the vendor's design methodology. For cores 
that do not have synthesizable HDL. artwork porting may 
still be the only option. 



112 August 1997 Hewlett-Packard Journal 

© Copr. 1949-1998 Hewlett-Packard Co. 



HDL porting will become increasingly feasible with better 
synthesis tools and denser and faster technology. The ad- 
vances in these two areas have now reached a threshold at 
which implementation of entire microprocessor cores with 
standard cells compiled using HDL synthesis is practicable. 
As more processors are designed using 111)1. and synthesis, 
this methodology will become more general. As the speed of 
the teclutology increases, the level of processor performance 
achievable using this methodology also increases. Silicon 
compilation is slowly becoming a reality. K " porting in the 
future should reach a level similar to porting software today, 
as designs are targeted to different teclmologies with a few 
changes in the synthesis and constraint scripts. 
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General-Purpose 3V CMOS 
Operational Amplifier with a New 
Constant-Transconductance Input 
Stage 

Design trade-offs for a low-voltage two-stage amplifier in the HP CM0S14 
process are presented and some of the issues of low-voltage analog 
design are discussed. The design of a new constant-transconductance 
input stage that has a rail-to-rail common-mode input range is described, 
along with the rail-to-rail class-AB output stage. The performance 
specifications and area of this amplifier are compared with a similar 
design in a previous process, CMOS34. 

by Derek L. Knee and Charles E. Moore 



Experience gained over llie last lew years within lite design 
renters of the HP Integrated Circuit Business Division ( It ill)) 
has shown that a general-purpose operational amplifier is a 
fundamental building block for many mixed-signal integrated 
circuits. These general-pun >ose operational amplifiers are 
typically used in support functions and not in the high- 
frequency differential signal paths. 

With the recent process release of AMOS 1 4TB. the analog 
version of the III' (Ml IS 1 4TB IC process, the logical step was 
to design a general-purpose operational amplifier for use with 
mixed analog/digital chips using AMOS14TB. However, from 
an analog Standpoint, the technology change from ( M< )S,'i4, 
the most recent process In which analog circuits had been 
implemented, to CM( )S14 was quite severe because of the 
power supply reduction from oV to :1.."1V. Because of the lower 
supply voltage specification, new circuit design techniques 
needed to be developed and the general-purpose operational 
amplifier was chosen as one of the lest vehicles to achieve 
this goal. The amplifier was also integrated onto an AMI IS14 

lesl chip. 

Design Objectives 

Because of the usefulness of the previous CMOS:i4 general- 
purpose operational amplifier, the electrical specifications 
for I he AMOS 14 version were derived from the CMOS34 
amplifier. The power supply range was altered because of 
the technology change. < Ither parameters such as input off- 
set voltage, input referred noise, and size were to be mini- 
mized, while open-loop v oltage gain, gain margin, phase 
margin, and power supply rejection ratio were to be maxi- 
mized. A list of the design objectives is shown in Table L 

Configuration 

Based on the design objectives shown in Table 1 and the 
experience of the authors in the design of prev ious general- 
purpose operational amplifiers, a two-stage configuration 



Table I 

Design Objectives (or AM0S14 Operational Amplifier 

Parameter Target Value 

Single-supply operation 2.7V < AVppSli.tiV 

Temperature range T,,,, ll'I'sT,,,? 1 10 C 

( httputs Single-ended 

Low quiescent power h>U- 1 mA 

consumption 

Small-signal bandwidth I MIIz<f„<5 Mllz 

f,,( unity gain) 

Small-signal bandwidth Independent of ('Mil* 

Slew rat e SR 1 V/us < SK < 5V/us 

Output voltage range AV S s + (» 2V< V 1HI , < AV'np-0.2V 

Common-mode input AV&s^ ( MIRsav^i, 

range ("MIR 
Load capacitance range f loail — 1"" I'F 
Load resistance range R|,,;„|r>:H>()£2 

with a class-AB output stage was chosen. This configuration 
is capable of satisfying the power and load requirements. An 
added constraint for the AMOS14 version (based on limita- 
tions of the previous versions) is the specification for con- 
stant small-signal bandwidth, independent of the common- 
mode input range. CMIR. The amplifier has a differential 
input, and the common-mode input voltage is the average 
Value of the tWO input voltages. ( 'MIR is the range over 
which the common-mode input voltage is expected to vary 
A Small-signal bandwidth that is independent of CMIR im- 
plies that the input differential stage has a constant small- 
signal input transconductanoe, g,„, over the full CMIR, even 
if the t'MIR is as large as the difference between the power 
supply rails. 

leveraging the new AMI >S14 circuit design from the existing 
( Mi IS.'M design was difficult because the power supply volt- 
age range is reduced while the I'MOS and NM< IS transistor 
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thresholds. Vjg and V m respectively, are essentially un- 
changed. The power supply range for AMOS14 is reduced by 
33% from that of CMOS34. This power supply reduction is 
fairly significant for analog designs in which devices are 
connected in series. The outcome was that new low-voltage 
design techniques had to he employed to implement the 
equivalent operational amplifier in AMOS14 technology. 

Constant-Transconductance Differential Input Stage 

To obtain a differential input stage that operates over a rail- 
to-rail input voltage range requires an NMOS and PMOS pair 
driven in parallel. Because of complementary biasing require- 
ments, special circuit design precautions need to be taken to 
ensure that the overall g m , or the sum of the individual tran- 
sistor g m s. remains constant over the CMIR. Without this 
added circuitry, the frequency compensation could not be 
optimized over the CMIR. 

The requirements for the constant-g m input stage are: 
A simple circuit with a minimum number of components 
Low-voltage operation 

Input devices operating in the square-law region where g m 
is highest. 

Constant-g m control circuitry operating in a closed-loop 
mode with the input differential devices to exhibit smooth 
transition regions over the CMIR 

Constant-g m control circuitry that does not use reference 
voltage trip levels to control the differential bias currents, 
thus avoiding coupling supply noise into the input stage. 

An extensive search of the literature 1 " 1 '' could not locate a 
circuit that met this list of requirements. Therefore, a new 
constant-gm input stage was needed. 

If I„, and I t| , are the tail currents of the NMOS and PMOS 
differential pairs respectively, then the following relation- 
ship is required for any common-mode input voltage: 



t 2KnItj, + t 2KpI Ip = g m = Constant, 
where 



and 



(1) 



(2a) 



,2b. 



hi equations 2a and 2b, u is the carrier mobility under the 
channel, C ox is the transistor gate capacitance per unit area 
W is the transistor gale width, and L is the transistor gate 
length. 

If the PMOS and NMOS transistors are sized so that K„ = K p 
then equation 1 can be rewritten as: 



V'ltn + y/Icp = Constant. 



(3) 



A new feedback control loop circuit was designed that con- 
trols the bias currents in the NMOS and PMOS differential 
pair transistors so that equation 3 holds for all common- 
mode input voltages. This new circuit is shown in Fig. 1. It 
uses what the authors refer to as the 4I/I principle. 

In Fig. 1. transistors NOA, NOB, N1A, and NIB form the 
NMOS input section. Devices POA. POB, P1A. and P1B form 
the PMOS input section. These two sections together form 
the input stage to an operational amplifier. The output cur- 
rents from these sections — Iqpp, IqpNi ^ONP- and Ionn — are 
summed in the first gain stage, described below. It is the 
overall g,„ of these NMOS and PMOS input devices that is 
held constant over the CMIR. 
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Fig. 1. Conskiiil tniiisronduitancr 
amplifier input singe. 
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The current mirror N2C biases the NMOS inpul pair and t he 
current mirror P2C biases Die PMOS input pair. The NMOS 
CMIR monitor devices, N1A and NIB, are biased by N2A and 
N2B at a current of 31. The PMOS ( MIR monitor devices, P1A 
and P1B, are aJso biased by P2A and P2B at a current of St 

For midsupply common-mode input range, both the NMOS 
input section and the PMOS input section are biased on. 
The PMOS CMIR input monitor devices, P1A and P1B, 
source a current of 31 to the node CMN. Tliis 31 source cur- 
rent is added algebraically to the II current sink of N2C, 
resulting in the NMOS differential pair, NOA and NOB, being 
biased at a current I. Similarly, the NM( IS CMIR input moni- 
tor devices, N1A and NIB, sink a current Of 31 from the node 
CMP. This 31 current is added algebraically to the 41 source 
current of P2C, resulting in the PMOS differential pair, POA 
and POB, being biased at a current I. Therefore the NMOS 
and PMOS input sections are both biased at I for Che mid- 
supply common-mode input. 

For common-mode inputs near AV'in,, the NM< )S input sec- 
tion is biased correctly, but the PMOS input section is off. 
The current source devices P2B and P2C are also off and 
the PMOS CMIR monitor devices, P1A and P1B, supply no 
current Since no current is added to the currenl source 
\'2< . i he NMOS differential pair, NOA and NOB, is now 
biased at a current of 41. A similar argument holds for the 
PMOS devices when the common-mode input is close to 
AV S s, and the PMOS transistors are biased at 41. 

The differential input sections will be biased in one of the 
following modes: 

L. The NMOS devices biased at 41 and the PMOS section 
with no bias current for low CMIR: 

v'4I + , 01 = Constant. (4a) 

2. The PMOS devices biased at 41 and the NMOS section 
with no bias current for high CMIR: 

v 01 + ,41 = Constant. (4b) 



200 
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Fig. 2. The x-axis represents the rumu ion-mode input range (CMIR) 
of the circuit of Fig. 1 from HSfgs to -^ l>p (rail to rail). The upper 
curve shows the overall g nl . The lower curves show the individual 
g m s of the input sections as a function of the CMIR. 



3. Both sections biased at I when the CMIR is such that both 
N2B and P2B are biased correctly: 

, II + , II = Constant. (4c) 

The closed-loop CMIR monitor circuitry smoothly controls 
the transition between these three modes of operation. This 
is demonstrated in Fig, 2. The x-axis of Fig. 2 represents 
the CMIR from AVjjg to AVnr, (rail to rail). The upper curve 
shows the overall g m or the sum of t he NMOS and PMOS 
input stage g,„s, while the lower curves show the individual 
g lh s of the input sections as a function of CMIR. The overall 
g m has a total variation of only 5%. This number includes the 
second-order effects of subthreshold operation and output 
conductance. 

Fig. 3 shows the simulated variation of the intrinsic input 
offset voltage. V,, s , as a function of the CMIR. This curve 
shows one of the limitations of a complex input differential 
pair input structure: the input offset voltage varies as each 
of the input differential pairs is activated or deactivated. 
During the transitions between modes, the common-mode 
rejection ratio, CMRR, is reduced. 1 " 1 - Therefore, the design 
of Fig. 1 attempts to minimize the width of these transition 
regions with respect to CMIR. 

First Gain Stage 

The first gain stage sums the four output currents from the 
input differential stage: IqpPi IqPNi U)NI'i *Od Ionn- The 
criteria for selecting the best gain stage were: 

• The stage should use wide-swing cascode current sources. 

• II should interface easily with the following class-AB output 
stage or second gain stage. 

• It should not add any additional noise or offset to the inpul 
stage. 

The gain and currenl summing stage selected is shown in 
Fig. 4. 14 This stage reduces the transistor count consider- 
ably because of its compact integration with the class-AB 
output stage (see next section). 
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Fig. 3. Simulated variation of the intrinsic input offset voltage, V", 
as a function of die CMIR 
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Fin. 4. Schematic diagram oftlte nrsi gain stage, which sums Che 
lour mil pin currents from the input stage 

Second Gain Stage 

The criteria used in (he selection of the da-ss-AB output 
stage implementation were: 
Simple and high-speed design 

No complex active or amplifier feedhaek paths in the AI5 

com nil circuitry 

Low-Vnn operation 

Good pow er supply rejection ratio 

No direct dependence on supply voltage lor bias current 

setup 

No noise orolTsel to lie added lo the first stage ofthe 
amplifier. 

The output stage chosen is shown in Fin. 6. The circuit 
shown in Fig. 6a is a simplified version ofthe output stage 
The schematic in Fig. ~>h shows the implementation ofthe 
AH output stage integrated together with the first gain Stage. 
This output stage was first developed for ">V operation 1 '' and 
later modified for an all-digital process."' 

The output stage uses common-source output devices for 
low-voltage operation. The theoretical minimum supply volt- 
age is twice the M< )S threshold voltage plus a saturation 



voltage. The complementary output devices PDR and NDR 
are driven by complementary common-gate level shifters. 
PAB and NAB. The first-stage input signals are fed into the 
OUtpOl stage at nodes PDRV and NDRV. During quiescent 
Operation, I'AB and NAB are biased in the conducting stale. 
The potentials at PDRV and NDRV are established to mini- 
mize the quiescent current through the large output driver 
devices. PDR and NDR. This biasing arrangement is estab- 
lished through two translinear loops. The loop that biases 
PDR consists of PSA, P5B. PAB. and PDR. Similarly, NDR is 
biased by the loop consisting of N5A, N5B. NAB. and NDR 
For a short tutorial on translinear theory see reference 17. 

During a negativ e slew at the output, the gate voltage of 
NDR Ls pulled high. Since the bias voltage ABN is fixed, the 
device NAB Will shut Off The device PAB Will be then con- 
ducting the full bias current. \y { . which will result in tin in- 
crease in the gat e-lo-soi live voltage of PAB and Consequently 
a reduction in the gate-to-souroe voltage of PDR. A similar 
operation occurs during positive sourcing when the bias 
voltage for NDR is reduced. 1,1 

The integration of the class-AB stage and the first gain stage 
has two major advantages. The first advantage is the Posting 
current source. Ip, . which is set up through two additional 
translinear loops: N5A, N5B, NFC. N3A and PSA, P">B, PFC. 
P3A Because of the floating nature of the bias devices NFC 
and PFC and NAB and PAB. this current source contributes 
much less to the noise and offset ofthe amplifier. Secondly, 
the variation ofthe output quiescent current is reduced 
because the floating current source of PFC and NFC tracks 
the AH current source of NAB and PAH. 

Final Circuit and Results 

Tin 1 complete schematic for the AMI )S I -I operational ampli- 
fier is shown in Fig. (>. This figure shows in detail the cas- 
cade current source implementation. 

Fig. 7 shows the open loop small-signal frequency response 
and phase characteristics ofthe amplifier driving four differ- 
ent load combinations. These are 10 MQ!|I pF. 10 MU||1()() pF, 
300QJ1 pF, and 100Q||100 pE Fig. 8 shows the small-signal 

frequency response and phase characteristics ofthe amplifier 

for different CMIR values ranging from AV'ss to AVpp. Note 
that the unity-gain frequency f n is essentially independent of 
CMIR. 

Tin- small-signal Step response is shown in Fig. 9 lot the 
same load c ombinations as Fjg. 7. The large-signal step 
response, indicative ofthe amplifier's slew rate, is shown 
in Fig. 10. The artwork layout for the operational amplifier 
is shown in Fig. 1 1. 

Table II illustrates the overall similarities of the AMOS1-1 
operational miiplificr lo the CM< >S.'J4 version. In summary, 
the AMI )S1 1 design achieved a 2 x improvement in band- 
width, a 2.5 x increase in class-AB output drive, anil a .'I x 
improvement in slew rate in a third ofthe area while at the 
same time including the additional constant-gin circuitry. 
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Fig. 7. (lop) Open-loop small-signal frequency response for 
■Jlfferenl load conditions (bot(orn) Pliase response. 
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Fig. 8. (top) Small-signal frequency response for different CMIR 
values, (bottom) Phase response. 
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Table II 

Amplifier Process Comparison 



Parameter 


AM0S14 


CM0S34 


Supply voltage 


2.7 to 3.6V 


4.5 to 5.5V 


Supply current 


(32!) |«A 


750 |iA 


Common-mode inpul range 


AV S s to 


AVss to 


1 MIR 


AVpi. 




< lonstant-fe, input stage 


Yes 


No 


lnpul stage g m variation, 


< ± 5% 


50% 


AV S s^<'MIR<\'|i|, 






Intrinsic Input offset voltage, 


-80 uV 


- 120 |iV 


CMIR = AV| 3I /2 






Resistive load 


300Q miii 


:.I00Q miii 


( npacitivc load 


10(1 pF max 


100 pF max 


Maximum output drive current 


± 5 irtA 


±2 FiiA 


Maximum output swing at I mas 


AV ni) -0.25 


AVpp-O.-'j 


Miriimum OUtpUl swing at I,„. IX 


AVss + 0.25 


AVss + 0.3 


Open-loop gain (no load) 


> 100 dB 


> 100 dB 


Slew rate 


6V/us 


O.nV/us* 


1 inity-gain bandwidth, f (l 


1 MHz 


0.6MHz* 


Phase margin ** 


55 degrees 


40 degrees 


Gain margin ** 


-WdB 


-8dB 


PSHR+, AV ss <rMIR<AVpp 


>70dB 


>80dB 


PSRR - , AVss s CmS s AVdo 


>70db 


>80dB 


fell size 


251 uni x 


400 urn x 




1 1 1 tint 


210 |un 



• Depends un CMIR 

" ' Absolute worst case conditions toi l|, lllv AVrjrj. R. C. models See Fig I 
CMIR - comman-mode inpul range 
PSRR = power supply rejection ratio 
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Improving Heat Transfer from a 
Flip-Chip Package 



The lid of an ASIC package can signif icantly increase the temperature of 
the die by impeding heat transfer In flip-chip packages the backside of a 
die can be exposed by eliminating the lid, thus allowing a heat sink to be 
attached directly Numerical finite difference methods and experimentation 
were used to investigate the differences between lidded and lidless 
flip-chip designs The results demonstrate that a lidless package is a 
superior design because of the increased thermal conductivity between 
the die and the heat sink. 

by Cttllen E. Bash and Ric hard L. Blanco 



The cooling of electronic components lias traditionally been 

considered as two separate problems optimizing die Internal 

thermal path Within the package, and cooling the packaged 
Component b) optimizing the external I hernial path. While 
this method has the advantage of being paiiitionahle and 
therefore solvable independently by separate organizations 
or companies, ii fails lo engineer the Ihermally oplimnm 
Solution, This is especially critical for high-power dice, 
which typically require custom heal sinks. 

The electronics industry is moving in the direction bf lidless 
Hip-Chip packages, which creale new possibilities for cooling 
the dice. Processor chips from olher nuuor elect ionics sup- 
pliers are currenlly available in lidless packages because of 

[heir thermal and cosl advantages.' 

As an experiment to improve the design of a high-power 

processor package. I he IIP PA NOOtl processor, a proposed 
design of a lidless package was compared lo the Iradllional 
lidded package currently in use. An example of a lidless 
package using an air cooled heal sink has been discussed in 
an earlier paper.-' In the present investigation, the proposed 
design uses the evaporator of a heal pipe assembly lo con- 
lad the die. thus replacing the lid. This concept has the 
additional benefit of reducing die cosl of die package by 
eliminating the relatively expensive lid. 

The investigation began by const ruciing finite difference 
models of die lidded and lidless packages. The purpose of 
die models was not lo correlate With measured results but 

to aid in understanding the magnitude of the relative im- 
provements of die lidless design. After reviewing the results, 
laborator) measurements were made of the two designs and 
the relative improvements in thermal performance were 
recorded. 

Two different methods were chosen to cool die packages. 
The heal pipe employed in die eurrenl IIP PA 8000 design 
was a natural choice because of its practicality. Additionally, 
because of concerns about thermal gradients in the alumi- 
num heat pipe ev aporator and the difficulty of matching 
these to I he boundary conditions in the finite difference 



model, a very efficient but impractical liquid cooled heat 
sink was chosen. The liquid cooled heat sink is highly effi- 
cient and behaves like an isothermal block, which is easily 
modeled. 

For consistency throughout this paper, the term aiuminum 

rra /in in la r lira I sink refers to the aluminum evaporator 
on the heat pipe assembly that directly sinks heat from the 
package. Likewise, the term ri>i>i>rr block heal sink refers to 
the copper block on the liquid cooled heat sink that acts in 
the same capacity. 

Package Construction 

The lidded and lidless package designs are shown in Fig. I for 
the aluminum evaporator heat sink. Both packages are con- 
structed identically between the printed circuit board and 
the die. .Mounted on an VH-4 printed circuit board is a plastic 
socket containing IDS!) contacts made from 0.025-mm gold 
plated molybdenum wire (see Fig. 2). A ceramic land grid 
array package resis on the socket, making electrical contact 
between the die and the board. The processor die is attached 
using flip -Chip technology,' resulting in about 2">(l() soldci 
bump conned inns encapsulated by an underfill material 
between Qte ceramic substrate and the silicon die. Fig. A 
shows die lidless package, plastic socket, and printed circuit 
board assembly. The aluminum carrier show n in the picture 
is used to support the assembly The heal sink has been left 
off so that the assembly can he seen more clearly. 

The lidded design uses silver-filled epoxy between the die 
ami the lid to enhance (hernial performance. The lid is fabri- 
cated from a Kovar ring brazed to a shed of tungsten copper. 
Fig. 4 shows the lidless and lidded packages side by side for 
comparison. A more detailed description of the lidded pack- 
age can be found elsewhere in (he lilcrature. 1 

Tin' lidless design uses How Corning :il<l thermal grease as 
the thermal interface above the die. This is a conservative 
choice considering thai (here are thermal greases available 
thai have Ihermal conductivities more than three times dial 

of How Corning 340. 
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Measurement Technique 

To compare the thermal performance of the two packages, 
a thermal lest die with a temperature-sensitive resistor was 
placed into each package to allow direct measurement of 
the die temperature. The packages were each tested on the 
same socketed printed circuit board connected to an HP 
75000 data acquisition system and a power supply. An HP 
9000 Series 300 workstation with data acquisition software. 
HP VEE, displayed the die temperature as a function of time 
while the power supply provided the power to Die die. 

The two thermal test dice were calibrated in a Delta Design 
9000 Series convective oven. Resistances were captured with 
the data acquisition system at four different temperatures 
ranging from 18 to 90 degrees Celsius. A least -squares 01 
was obtained for each package and the results were placed 
into HP VEE. 

Four experiments were undertaken comparing each pack- 
age — lidded and lidless — cooled by each of the heat 
sinks — the aluminum evaporator and the copper block. 

Copper Block. Tbe copper block heal sink was used to pro- 
vide an isothermal surface to I he package to which it was 
attached. This was accomplished via an efficient liquid 
cooled heat sink mounted to the backside of the highly 
conductive copper block as depicted in Fig. 5. The liquid 
cooled heat sink consists of a partially hollowed aluminum 




Fig. 2. Plastic Socket With It >K!) gold plated molybdenum wire 

contacts. 



block through Which Water IS cycled. The water is cooled by 
ambient air via a heat exchanger. Measurements showed 
that the surface of the copper block was kept isothermal to 
wtthfal 3°C, which indicated that the liquid cooled heat sink 
was functioning as intended. 

Each package was tested with the copper block heal sink by 
compressing it between the upper and lower sliffeners with 
a ( -clamp. The setup was similar to Fig. 6 but wilh the heat 
pipe replaced by the liquid cooled heal sink. A load cell was 
employed to measure the compressive force being generated 
by the clamping assembly and stiffener plates were used to 
distribute the 0-clamp load. Each assembly was compressed 
to ISO pounds to ensure comparable contact resistance be- 
tween the two packages. Three thermocouples were placed 
within the copper block to record the heal sink temperature. 

Aluminum Evaporator. The aluminum evaporator is cooled by 
a heal pipe assembly. The assembly is constructed of I hive 
sintered copper pipes wilh water as the working fluid 
mounted planar lo the evaporator, and thin aluminum litis 
are attached to the opposite end of the pipes. Heat from the 
aluminum ev aporator enters the pipes, causing the water 
to vaporize. The steam is condensed at the other end of the 
pipes by air flowing over the fins. The water then returns lo 
the evaporator via capillary action, thus completing the ther- 
modynamic cycle. Upon measurement, it was discovered 
that the aluminum evaporalor was indeed isothermal like 
the copper block, although at a higher temperature. 

The aluminum evaporalor was used to test the thermal per- 
formance of the packages In a manner similar to the copper 




Fig. 3. Tin' experimental assembly without the heat sink, showing 
the printed circuit hoard, socket, ami lidless package. 
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Fig. 4. Lidlrss Ud ll'lclrd (larkages used ui the pxpenniem 
Tlir- lidded iraikage i& M the right 



block. A clamping assembly comparable to thai used for 
the copper Mock was employed (Ihe clamping assembly is 
shown in Fig. (i with the heat pipe). The entire assembly was 
placed in a wind tunnel with a nominal velocity of 1.8 meters 
per second. A single thermocouple was placed near the 
evaporator plate/package interface to record temperature. 

Data Comparison Methodology 

Thermal resistance will be used throughout this paper as a 
means of Comparing the data obtained from modeling and 
measurement. It is denned by equation 1 and frequently 
calculated using empirical data with equation 2: 

R = L/(kA) ( I ) 

R = AT/Q (2) 

where I, is the thickness of the material, k is the material 
thermal conductivity, A is the cross-sectional area, AT is 
the measured temperature difference, and Q is the heal 
flow. By definition, thermal resistance is applicable for one- 
dimensional, steady-state heal transfer with no internal 
energy generation. In electronics packaging one rarely en- 
counters one-dimensional heal transfer and there is signifi- 
cant internal energy generation in Ihe silicon die. Additionally, 
il is rarely ever known explicitly how much heal is Rowing 
into ihe heal sink relative to thai being absorbed by Ihe 
board. Typically, if no additional Information is known il is 
assumed Ihal all of Ihe heal is dissipaled into Ihe heal sink. 
Nevertheless, with Ihe restrictions on equations I and 2 and 
the unknowns involved, thermal resistance remains a useful 
quantity for the comparison of similar packages on similar 
printed c ircuit boards and will be used in thai capacity in 
the interpretation of results. 



Modeling Technique 

A software tool employing a finite difference method was 
used to create models to represent Ihe cooling of the pack- 
ages under test." One model was created for the lidded de- 
sign and a second was created for the lidless design. With 
each model, either Ihe copper block or ihe aluminum evapo- 
rator COUld be activated as the heal sink. 

Two simplifications were made in modeling the packages. 
Components of the model that were thin layers, such as ihe 
epoxy and grease layers, were modeled as internal plates 
with only on<"-dimensional conduction, normal to the surface 
of Ihe layer. Secondly, lo simplify the model and reduce 
large grid aspect ratios and thus convergence lime, geometry 
that was nearly coincident and thermally insignificant was 
spatially aligned. For example. Ihe plastic socket housing 
is 0.7 mm larger than the ceramic but was modeled as the 
same overall size. 

The FR-l/coppcr multilayer primed circuit board was model- 
ed as a solid FH-4 block with a single layer of copper of 
thickness equivalent to the combined thicknesses of the 
copper layers in the board. The conductivity of the multi- 
layer printed circuit board was calculated to be equivalent 
to the copper and FR-1 material in parallel, while the conduc- 
tivity of the single copper layer placed within the modeled 
printed circuit board was made equivalent to Ihe copper and 
FR-1 material in series. ( )nly solid Copper planes were in- 
cluded in the model since discontinuous signal planes have 
been determined to be Inconsequential in conducting heat.' 

To simplify Ihe HlSii individual metallic contacts of the socket 
in the plastic housing, a block of equivalent conductivity to 
Ihe HIS!) individual <l.02">-mni-diameler molybdenum wires 
was combined in parallel with the conductivity of the plastic 
housing. 

Similarly, the solder bump layer with underfill was modeled 
as the area of 2- r >00 solder bumps in parallel with the area of 
the underfill compound, with Ihe conductivity of the internal 
plate appropriately weighted by Ihe product of the thermal 
Conductivity and Ihe area of each material. 

The copper block was modeled as an isol hernial volume 
with a negative internal power source (i.e., a sink). The 
evaporator assembly, while more difficult to approximate, 
was modeled as an aluminum block with negative internal 
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Fig. 6. Experimental setup with heat pipe. 



power sources that were of the same volume and in the 
same locations as the heal pipes used in the experiments. 
The actual cross sections of the heat pipes were modeled 
as squares because of the orthogonal limitations Of the soft- 
ware tool. 

The models were const rucled lo calculate conduction 
through the package to study the effects of various construc- 
tions and materials. To simplify and reduce convergence 
time, cooling from natural convection was not considered. 
This method allows good comparative results for small 
changes in materials hut does not yield results that could 
he directly compared with measurements. Nevertheless, the 
purpose of the modeling was not to correlate numerical data 
with experimental data, hut rather to determine whether 
experimentation would be worthwhile. 

After the models were created, grid sensitivity calculations 
were done to ensure that the results were not affected by 
numerical computation errors induced by grid size or aspect 
ratios. 

Modeling Results 

Copper Block. The modeling results for the copper block are 
presented in Table I. These results show (hat the thermal 
resistance between the die and the heat sink of the two 
package styles was identical, within modeling error and for 
the assumptions made in the model. 



Table I 

Modeled Thermal Resistance lor Copper Block 
Package Type Thermal Resistance ( C/W) 

Lidded 0.21 
Lidless 0.21 



Aluminum Evaporator. The results for the aluminum evaporator 
are shown in Table II. Again, the t hernial resistance between 
the die and the heal sink was nearly identical between the 
two designs. The model shows a small benefit in the lidded 
design. 

Modeling Summary. Given the considerable assumptions and 
simplifications, it was difficult to draw a strong conclusion 
based solely on the modeling results. Considering the small 
differences between the two designs, it was very compelling 
lo construct the packages and measure them. 



Table II 

Modeled Thermal Resistance for Aluminum Evaporator 
Package Type Thermal Resistance ( C/W) 

Lidded 0.24 
Lidless 0.20 



Measurement Results 

Temperature measurements were taken for each of the four 
package and heal sink combinations. The results are pre- 
sented in 'fables III and IV. Included in each table are the 
power dissipation, heat sink temperature, die temperature, 
and thermal resistance. The thermal resistance column refers 
lo the thermal resistance between the die and the heat sink. 
It includes the separate resistances of the die. epoxy and lid 
(if applicable), thermal grease, and a portion of the heal sink 
through which the thermocouples were embedded. 

Copper Block. Table III displays thermal data from each 
package using the copper block heat sink and Dow Coming 
340 thermal grease at the heat sink/package interface. 
Note that the thermal resistance decreased by 50% with 
the removal of the lid. 



Table III 

Thermal Performance of Packages with Copper Block 



Pack- 


Power 


Heat Sink 


Die 


Thermal 


age 


Dissipa- 


Temperature 


Temperature 


Resistance 


Type 


tion (W) 


( 0 


1 0 


(C/W) 


Lidded 


93.3 


40.2 


55.1 


0.10 


Lidless 


93.3 


40.0 


47.6 


o.os 



Aluminum Evaporator. Data From the two packages \\ i t li the 
aluminum evaporator acting as the heat sink and Dow Com- 
ing 340 thermal grease at the interface is presented in 'fable 
IV. Note thai both the packaged die temperatures and the 
heat sink lemperalures increased using the aluminum evap- 
orator because it is nol as efficient as the copper block. The 
thermal resistance decreased slightly for each package type 
over that obtained in Table III. This is most likely because of 
differences in thermal grease application or thermocouple 
placement. Finally, the thermal resistance decreased 53% 
upon removal of the lid. As expected, Ihe decrease in ther- 
mal resistance is independent of I he type of heal sink used. 



Table IV 

Thermal Performance of Packages with Aluminum Evaporator 



Pack- 


Power 


Heat Sink 


Die 


Thermal 


age 


Dissipa- 


Temperature 


Temperature 


Resistance 


Type 


tion (W) 


m 


( O 


(C/W) 


Lidded 


85.8 


63.9 


77.1 


0.15 


Lidle.ss 


S5.3 


66.2 


72.2 


0.07 



Measurement Summary. The measured thermal resistant i >l 
the lidded package compares very favorably with measure- 
ments taken by other investigators.' 

Table V displays the amount of power that can be dissipated 
by each heat sink/package combination at equivalent die and 
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Table V 

Allowable Power Dissipation for Equivalent Die Temperatures 

of 110 C and Air Temperature of 50 C 

Heat Sink Interface Calculated 

Thermal Thermal Power 

Resistance Resistance Dissipation 

Package Type I C/W) I CAV) IW) 



Copper 

Block: 

Lidded 

Lie I less 

Aluminum 
Evaporator: 

Lidded 

Lidless 



II IT 

0.17 



0.40 
0.46 



0.10 

II.IIS 



0.15 
0.07 



ISO 

242 



113 



air temperatures. The heat sink thermal resislanee refers to 
the thermal resistance between the heat sink thermocouples 
and the ambient air. The results indicate thai the lidless 
package is a significantly belter performer than its lidderl 
Counterpart- The lidless package attached tO ihe copper 
block is able to dissipate 34% more power, or lil! walls more 
than Ihe lidded version. Likewise, for Ihe aluminum ev apora- 
tor, Ihe lidless package is able lo dissipate 15% more power 
or 15 waits more. Note lhal a larger relative improvement is 
realized by using a mora efficient heat sink. These calcula- 
tions assume no losses other than through the heal sinks but 
clearly show the superiority of lidless package designs over 
lidded. 

The superiority of Ihe lidless package over Ihe lidded, w hile 
expected, may not be as obvious lo predict as il first appe.ii-s. 
( >ne of the main arguments for keeping the lid on the package 
is I hat il decreases Ihe heal flux by increasing ihe surface 
area through which heal can leave the package to the heal 
sink. By a one-dimensional analysis, it can be shown lhal the 
I hernial resislanee of Ihe lid is an order of magnitude less 
than thai of the epoxy. This indicates lhal using Ihe lid as a 
heal spreader lo decrease the heal flux through the package 
is not necessarily a bad idea. Hal her. it is Ihe bonding of the 
lid to the die with a layer of epoxy lhal makes it a relatively 
poor I hernial solution. If a lid must be used for reasons oilier 
than thermal performance, il is clear lhal an effort should be 
made lo reduce as much as possible Ihe thermal resistance 
of ihe bonding material by decreasing its thickness and/or 
increasing its thermal conductivity. 



Summary and Conclusions 

The results from the modeling showed that the thermal 
performances of the packages were very similar and the 
lidless design warranted further inv estigation through lab 

measurements, 

Comparison of the thermal resistances ol the two pac kage 
styles was very consistent for both the COpper block and ihe 
aluminum evaporator measurement methods Both measure^ 

ment methods showed about a BOM improvement in thermal 

resistance in the lidless design. 

While impractical for low-cost computer systems, the liquid 

cooled copper block measurements determine some limits 

of cooling of the HP PA 8000 die The lidded design could 
dissipate ISII watts of power while the lidless solution could 
dissipate 242 watts while maintaining the temperature of the 
die within the limits for reliable operation. 

The measured results indicate that the lidless package is 
thermally superior to the lidded design. For the aluminum 
evaporator. 15 more watts could be dissipated while main- 
taining the same die leniperature. This is of particular signif- 
icance because a heat pipe assembly is one of the present 
cooling designs for the IIP PA 8000 processor. 

To obtain the thermal performance required in next-genera- 
tion chips, the cooling design will need to be solved as a 
coupled problem, considering the Complete thermal path 
Originating from ihe surface of the die and ending in Ihe 
cooling air. The lidless package is one possible solution. 
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Japanese flower arranging called ikebana, for which 
she has an instructor's license. 
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Timothy P. Loomis 

f Software engineer Tim 
Loomis is currently a techni- 
cal lead at HP's Chemical 
k^h Analysis Solutions Division. 
" in the last nine years he has 
worked on the software de- 
sign and development of 
several laboratory informa- 
tion and database products, 
including HP ChemLMS. LAB/UX, ChemlCAL, and 
ChemSiudy He is professionally interested in infor- 
mation systems design He received a PhD in geology 
and geophysics in 1971 from Princeton University He 
served on the faculty at Yale. UCLA, and the Univer- 
sity of Arizona His principal research was in the area 
of computer simulations of nonequilibrium thermo- 
dynamic models of multicomponent diffusion and 
crystal growth in minerals. He has authored over 
thirty papers in the fields of geophysics, geology, 
applied artificial intelligence methods, and software 
prototyping He also wrote about the origins of pot- 
tery of prehistoric Indians in Southern Arizona After 
leaving the university system and before joining HP, 
he did consulting for two years in the database mod- 
eling field Tim was bom in California He has two 
teenage daughters. In his free time, he enjoys vigor- 
ous outdoor activities including sea kayaking, skiing, 
bike touring, backpacking, and trekking in remote 
areas. At the time of this publication, he plans to be 
sea kayaking off Ellesmere Island in the Arctic 




88 Policing in ATM Networks 



Mohammad Makarechian 

Mohammad Makarechian 
earned a BS degree in com- 
puter engineering in 1992 
from the University of Al- 
berta, Canada and an MS 
degree in computer engi- 
V j neering in 1994 from Boston 
V^HjBL University After graduating 
Ai'iX^^mMi ■ he joined HP's Communica- 
tions Measurements Division, where he has worked 
on software development, performance analysis, and 
quality assurance. He is currently developing applica- 
tion software for modular components of the HP 
Broadband Series Test System and recently worked 
on the HP E4223 ATM policing and traffic character- 
ization software Before that he led the software 
development for the HP E4219 ATM network impair- 
ment emulator and contributed to the HP E4209 cell 
protocol processor. Mohammad was born in Tehran. 



Iran. He is a member of the Associaiion for Comput- 
ing Machinery (ACM) and is involved with a special 
ACM interest group on data communications 

Nicholas J. Malcolm 

A software developer at HP's 
^^^^H^k Communications Measuie- 

^^^^^^B rnents Division since 1994, 
Nicholas Malcolm was the 
y technical lead for the devel- 
opment of the HP E4223A 
ATM policing and traffic 
characterization test applica- 
tion. He is currently develop- 
ing software for an ATM operations and maintenance 
(0AM I tester module in the HP Broadband Series Test 
System He is professionally interested in real-time 
communication, distributed systems, and software 
design and is a member of the IEEE Born in Murray 
Bridge, Australia, Nicholas received a B Sc. degree 
with honors in computer science from the University 
of Adelaide in 1989 and an M Sc degree in computer 
science from the University of Calgary in 1991 He 
wenl on to earn a PhD in computer science from Texas 
ASM University in 1994 

94 MOSFET Scaling 



Paul Vande Voorde 

Paul Vande Voorde received 
a PhD degree in solid-state 
physics from the University 
of Colorado in 1980. In 1981 
he joined HP Laboratories, 
working initially in the area 
of inkjet printhead fabnca- 
tion, and then in the area of 
silicon device fabrication 
and modeling. He has contributed to the development 
of advanced CMOS, bipolar, and BiCMOS processes 
He was a member of the HP-25 bipolar process de- 
velopment team. His present areas of research are 
process and device simulation as applied to highly 
scaled CMOS devices. He has authored or coauthored 
fifteen technical papers on silicon processing, pro- 
cess modeling, and device modeling He is named as 
an inventor in five patents concerning silicon pro- 
cessing and coauthored a textbook called Computer- 
Aided Design and VLSI Device Development, which 
was published by Kluwer Academic Publishers in 
Boston, Massachusetts in 1988 Bom in Chamberlain. 
South Oakota, Paul served in the U S Army from 
1972 to 1994 He is married and has one child. In his 
free time he enjoys playing with his son and going 
hiking 
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99 Clock FM for EMI Reduction 



Cornells D. Hoeksira 

Casey Hoeksira is an engi- 
neer at HP's Integrated Cir- 
cuit Business Division and is 
currently a project leader for 
ASIC development- He is 
named as a comventor in a 
patent on testing integrated 
circuit pad input and output 
structures. He received BA 
and MS degrees in physics in 1976 and 1997. respec- 
tively, both from the University of Oregon He joined 
the thermal printhead group at HP's Corvallis Site 
Operations in 1977 and a year later moved to CMOS 
process engineering, where he worked in photo- 
lithography, yield and reliability, test, and design en- 
gineering He coauthored a 1987 HP Journal article 
about the development and application of test hard- 
ware and software for the multichip hybrid printed 
circuit board used in advanced handheld calculators, 
such as the HP Business Consultant calculator. Born 
in Schyndel. The Netherlands. Casey is married and 
has two daughters. He served in the U S Army as a 
medic from 1970 to 1973. His outside interests in- 
clude working around the house, gardening, hiking, 
camping, and climbing. 

105 HDL Microprocessor Porting 

Jim J. Lin 

Jim Lin received a BS degree 
in electrical and computer 
engineering in 1994 from 
Carnegie Mellon University 
and an MSEE degree in 
1996 from Stanford Univer- 
sity He worked as a summer 
intern at HP's Integrated 
Circuit Business Division in 
1993 and began working full-time as a design engineer 
in 1994. implementing embedded microprocessors 
and integrating them on a single chip with other ASIC 
functionality For the protect described in this issue 
he worked on modifying the HDL for the cache con- 
troller and on synthesis, simulation, and verification. 
He is currently responsible for designing several con- 
trol paths for the next-generation processor and for 
its overall simulation and synthesis strategy He is 
professionally interested in processor microarchitec- 
ture modeling and synthesis Born in Shanghai, China. 





ne is married and enioys outdoor activities and sports 
such as camping, hiking, soccer, and basketball He is 
a big sports fan and would love to try his hand at 
sportscastmg 

112 3V CMOS Operational Amplifier 



Derek L Knee 

Derek Knee is a technical 
contributor and design engi- 
neer at HP's Integrated Cir- 
cuit Business Division He 
recently designed the gener- 
al-purpose 3V CMOS opera- 
tional amplifier described in 
this issue and is currently 
designing the analog front 
end for an optical position encoder for a handheld 
scanner He is named as an inventor in three patents 
concerning programmable integrated circuits, RF 
emissions, and fully differential flash ADCs He is a 
member of the IEEE and is professionally interested 
in analog design methodologies. He received BSEE 
and MSEE degrees in 1979 and 1981 . respectively, 
from the University of Natal in South Africa. Before 
joining HP, he worked at Exar Corporation designing 
analog bipolar and CMOS ASICs and at Samsung 
Semiconductor designing high-speed PRML disk drive 
channels using BiCMOS technology Since coming 
to HP in 1987 he has also designed CM0S14 analog 
cells, a servo controller, and other circuits. Born In 
Pinetown. Natal. South Africa. Derek served in the 
South African military in 1975 Married, he is a iri- 
athelete and has been a drummer since the age of 9 

Charles E. Moore 

A technical contributor at 
HP's Integrated Circuit Busi- 
ness Division, Charles 
Moore is currently the tech- 
nical lead on an optical posi- 
tion encoder chip and is 
doing consultations on an- 
■ other chip with optical con- 
™ tent He has worked with HP 
for thirty years and is named as an inventor in over 
fifteen patents on lens design, IC design, and the 
system design of instruments. He has authored sev- 
eral articles in the HP Journal on his work He is a 
member of the Optical Society of America and is a 
pasl president of the Rocky Mountain chapter He is 




professionally interested in analog IC design, optics, 
and system design He received a BSEE degree from 
the University of California at Berkeley in 1966 ana 
an MS degree in optics from the University of Roch- 
ester in New York in 1978 His HP projects include 
working as a product and process engineer on a sili- 
con thermal rms converter and working on the opti- 
cal, system, and receiver electronics design for the 
HP 3820 surveying total station Born in Santa Fe. 
New Mexico, he served in the U S Army from 1960 
to 1963. Charles is married and has five children and 
one grandchild. He has done volunteer work with the 
Democratic party and his hobbies include playing 
chess, directing chess tournaments, and studying the 
history of technology 

120 Heat Transfer from a Flip-Chip 
Package 

Cullen E. Bash 

Jj^j^H neer at HP's Network Server 
I Division, Cullen Bash is cur- 
T -W L 9 ,enll V working on the thermal 

"T^M and mechanical design uf 
i J^H the HP NetServer line of 

t^^k network servers. Cullen re- 
' ceived BS and MS degrees 

^^^^^^M in mechanical engineering in 
1994 and 1995, respectively, from the University of 
California at San Oiego He then joined HP's Systems 
Technology Division For the HP 9000 Model T600 
corporate business server, he was responsible for the 
thermal design of the memory board assembly and 
for the thermal and mechanical design of the I/O 
assembly IHSC bus converter!. 

Richard L Blanco 

Rich Blanco is an R&D proj- 
ect manager at HP's Systems 
Technology Division He 
joined HP in 1 984 after re- 

k| ceivmg a BS degree in fluids 
m^^^^k and thermal science from 
Case Western Reserve Uni- 
j£^fl ■ he earned an 

MS degree in mechanical 
engineering from Stanford University. Rich is married 
and has two children In his free time, he enjoys 
spending time with his family, especially playing 
soccer and going camping. 
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Reader Forum 



The HP Journal encourages technical discussion ol the topics presoniBd in recem articles 
and will publish letters expected to be ol interest to our readers Letters must be brief and 
are subiect to editing. Letters should be addressed to 

Editor, Hewlett-Packard Journal 
3000 Hanover StreBt. 20 BH 
Palo Afto, CA 94304, USA. 

Editor: 

In ihe recent article entitled "Hie Global Positioning System 
;tnrl HP Stuart! lock" by John A. Kiislers, which appeared in 
the December 1996 issue of the Hewlett-Packard Journal, the 
relationship of Global Positioning System I GPS) time to Coor- 
dinated Universal Time ( 1 "IT) was inadequately explained. 

1. DTC is a time scale maintained officially by the BIPM 
(Bureau Internal iottal des Poids el MesuresJ using clocks 
from a number of laboratories around the world. UTC is 
available to users With a delay of up to two monilis he- 
cause ol the need to analyze Ihe contributed clock data 
very carefully. It is adjusted occasionally by one second 
so that die absolute value of the difference between UTC 
and the astronomical time scale ( I Tl ) does not exceed 
ll.ll second. These one-second adjustments, called I'-n/i 
smmils. are nol made in GPS time. As a result. (IPS lime 
will be different from I Tl ' by an integral number of sec- 
onds plus a small synchronization error. As of July I. 199?, 
die difference is essentially 12 seconds. 

2. Similarly, the I '.s. Naval Observatory (USNOl maintains a 
time scale thai, by international agreement, is steered to 
be near I TC. Approximately forty HP 51171 A primary fre- 
quency Standards and ten hydrogen masers are used to 
accomplish this. The I 'SNO master clock provides a real- 
lime realizat ion, UTC(USNO MC), of the calculated USN< I 
time scale. Over the past year, the difference between 
UTC and CTC(l"SN< i MC) has not exceeded thirty nano- 
seconds. 

a The liming of the OPS system (GPS time) is maintained 
by the CiPS Master Control Stalion ( MI'S ) al Falcon AFB. 
Colorado Springs, Ci ). using observed time differences 



between Ihe USNO Master Clock and ihe (iPS time broad- 
cast by the satellites in the (IPS system. This data is col- 
li ■< -I by Ihe Naval Observatory and made available to 
ihe MCS daily. Csing this information, GPS time is steered 
toward I "f( '( I SN( ) MC ) by M( S personnel. 

1. The GPS system not only maintains GPS time, il also pro- 
vides infonnalion in the navigation message broadcast by 
each satellite lo enable the user lo extract a representation 
iif i Ti i i s\( i mc ). Currently ihe rms difference between 
lTC(CSNO MC) and the I "ft ' available from the satellites 
is in the neighborhood of 8 nanoseconds. By agreement, 
GPS time is to be maintained within 1 microsecond of 
I "ft '( ISNO MC) absent the inlegral seconds, anil the 
representation or UTC(USNO MC ) available from Ihe GPS 
system is lo be kepi accurate within loo nanoseconds, 

5. To satisfy the many users of precise time and frequency, 
the Naval Observatory continues to maintain its lime 
scales as close as possible lo thai provided by BIPM. 
These users include I hut are not limited lo ) users of the 
GPS system. No plans to change Ihe agreements men- 
tioned above are in place or contemplated. 

Dennis D. McCarthy 
Director. Directorate of Time 
U.S. Naval Observat ory 

John A. Kuslers 
Principal Scientist 
Hewlett-Packard ( ompany 




5965-5918E 



© Copr. 1949-1998 Hewlett-Packard Co. 



